Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters

Size: px

Start display at page:

Download "Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters"

Barbara Wilkinson
5 years ago
Views:

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Xingfu Wu and Valerie Taylor Department of Computer Science, Texas A&M University Email: {wuxf, taylor}@cs.tamu.

1 Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Xingfu Wu and Valerie Taylor Department of Computer Science, Texas A&M University {wuxf, Abstract The MIMD Lattice Computation (MILC) code (version 7.4.) is a set of codes developed by the MIMD Lattice Computation collaboration for doing simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. In this paper, we analyze the performance of the MILC code on four large-scale cluster systems: UNC RENCI IBM BlueGene/L, SDSC DataStar, NERSC Seaborg and Jacquard. We discuss the scalability of the MILC code with processor and problem scaling, use processor partitioning to investigate the performance impacts of the application components, and identify possible performance optimizations for further work. Then we use Prophesy to online generate the performance models for MILC on each platform so that we can use these models to predict the performance on larger number of processors. 1. MILC code The MIMD Lattice Computation (MILC) code (version 7.4.) [1] is a set of codes written in C developed by the MIMD Lattice Computation collaboration for doing simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. The MILC code is one of applications packages that use SciDAC modules [2] shown in Figure 1. The MILC code uses five modules from SciDAC software modules: QMP, QLA, QIO, QDP/C, and QOPQDP. QMP (QCD Message Passing) provides a standard communications layer for Lattice QCD based on MPI. QLA (QCD Linear Algebra) provides a standard interface for linear algebra routines that can act on a site or a array of sites with several indexing options. QIO (QCD Input/Output) provides as suite of input/output routines for lattice data. QDP/C is the C implementation of the QDP (QCD Data Parallel) interface. QOPQDP is an implementation of the QOP (QCD Operations) level three interface using QDP. Figure 1. The SciDAC Layers and the software module architecture [2] 1

2 In this paper, we focus on simulations with dynamical Kogut-Susskind fermions with improved actions. The program su3_rmd uses the refreshed molecular dynamics algorithm for the Symanzik 1 loop gauge action and Asqtad quark action. 2. Experiment Platforms In this section, we describe the specifications of four large-scale cluster systems: Linux cluster Jacquard [3] at NERSC (National Energy Research Scientific Computing center), IBM BlueGene/L system [4] in RENCI at the University of North Carolina, IBM POWER3 SMP cluster Seaborg [5] at NERSC, and IBM POWER4 SMP cluster DataStar [6] at SDSC (San Diego Supercomputing Center). NERSC Jacquard has 356 dual-processor nodes available for scientific calculations. An additional 8 spare nodes will be in service if available. Four nodes are dedicated as login nodes. There are additional I/O and service nodes. Each processor runs at a clock speed of 2.2GHz, and has a theoretical peak performance of 4.4 GFlop/s. Processors on each node share 6GB of memory. The nodes are interconnected with a high-speed InfiniBand network. Shared file storage is provided by a GPFS file system. UNC RENCI BlueGene/L is designed to achieve high performance for low cost and with low power consumption. The BG/L has the following configuration: Compute dual 7 MHz PowerPC 44 nodes with 1GB memory per node Storage - 11 TB of clusterwide disk storage Network - IBM BlueGene Torus, Global Tree network (2.1 GB/s) and Global Interrupt Table 1. Specification of four large-scale clusters System Name Jacquard RENCI BlueGene/L DataStar Seaborg Number of Nodes CPUs per Node CPU type 2.2GHz AMD Opteron 7 MHz IBM PowerPC GHz POWER4 375MHz POWER3 Memory per Node 6GB 1GB 16-32GB 16-64GB Network InfiniBand Torus Federation Colony OS Linux Linux AIX AIX The NERSC Seaborg is a distributed memory computer with 6,8 processors available to run scientific computing applications. The processors are distributed among 38 compute nodes with 16 processors per SMP node. Processors on each node have a shared memory pool of between 16 and 64 GB (312 nodes with 16GB, 64 nodes with 32GB, and 4 nodes with 64GB). The compute nodes are connected by the IBM Colony switch. The use of 16-way nodes is exclusive: only one user is allowed to use a node at any given time, regardless of the number of CPUs one needs on that node. The DataStar is SDSC's largest IBM terascale machine. It has 176 (8-way) P655 compute nodes with 1.5GHz POWER4 and 16 GB memory per node, and 96 (8-way) compute nodes with 1.7GHz POWER4 and 32 GB memory per node. The nodes are connected by 2

3 the IBM Federation switch. The use of 8-way nodes is exclusive: only one user is allowed to use a node at any given time, regardless of the number of CPUs one needs on that node. 3. Performance Analysis In this section, we only use su3_rmd to do some performance analysis of the MILC code on four large-scale cluster systems, especially discuss the scalability of the MILC code on up to 248 processors. We also use Prophesy [7, 8] to instrument the code to collect performance data for our analysis. We ported this code and its libraries to SDSC DataStar and NERSC Seaborg. UNC RENCI BuleGene/L has 2 processors per node, NERSC Jacquard has 2 processors per node, SDSC DataStar has 8 processors per node, and NERSC Seaborg has 16 processors per node. The performance results in this section are for using all processors per node. Note that the total execution time for the application means the time it takes from the beginning of the application program to the end, which includes I/O time, too. 3.1 Datasets for Problem and Processor Scaling The problem sizes for the MILC code are listed in Table 2 (Input-8) and Table 3 (Input-1), where nx is the size of X dimension of grid, ny is the size of Y dimension of grid, nz is the size of Z dimension of grid, and nt is the size of T dimension of grid. Table 2. Dataset Input-8 with scaling the number of processors #Procs nx ny nz nt Table 3. Dataset Input-8 with scaling the number of processors #Procs nx ny nz nt For problem dataset input-8, the workload per processor is 8x8x8x8. For the problem dataset input-1, the workload per processor is 1x1x1x Weak Scaling In this section, we use the execution time on 8 processors as a baseline, and define that the relative slowdown is the execution time on p processors divided by the execution time on 8 processors. Given the fixed workload per processors, the relative slowdown quantifies how much the application performance degrades with increasing the number of processors. 1) Problem size (input-8): 8x8x8x8 per processor 3

4 Runtime vs Processors for MILC7.4 (Input-8) 3 25 Runtime (secs) BlueGene/L Jacquard DataStar Seaborg Processors Figure 2. Performance comparisons for MILC code with Input-8 on four clusters Relative Slowdown vs Processors for MILC7.4 (Input-8) Slowdown BlueGene/L Jacquard DataStar Seaborg Processors Figure 3. Relative slowdowns for MILC code with Input-8 on four clusters 2) Problem size (input-1): 1x1x1x1 per processor 4

5 Runtime vs Processors for MILC7.4 (Input-1) 6 5 Runtime (secs) BlueGene/L Jacquard DataStar Seaborg Processors Figure 4. Performance comparisons for MILC code with Input-1 on four clusters Relative Slowdown vs Processors for MILC7.4 (Input-1) Slowdown BlueGene/L Jacquard DataStar Seaborg Processors Figure 5. Relative slowdowns for MILC code with Input-1 on four clusters Overall, the MILC code has the best scalability on SDSC DataStar with the weak scaling. When increasing workload per processor from 8x8x8x8 to 1x1x1x1, the relative slowdown on BlueGene/L is the largest after 64 processors because of small memory size per node (i.e., 1GB per node). The execution time of the MILC on Jacquard increases significantly starting from 128 processors or more. 3.3 Strong Scaling Based on problem size of 32x32x32x32, we discuss the scalability of the MILC code on up to 248 processors. 5

6 Relative Speedup for 32x32x32x32 1 Speedup (log) 1 BlueGene/L Jacquard DataStar Seaborg ideal Processors Figure 6. Relative speedups for MILC code with the problem size of 32x32x32x32 on four clusters Overall, the MILC code has the best scalability on RENCI BuleGene/L with the strong scaling. Although BlueGene/L has the smallest memory size per node (i.e, 1GB per node), for the fixed the problem size of 32x32x32x32, with increasing the total number of processors, the workload per processor decreases significantly. This benefits BlueGene/L. 3.4 Application Sensitivity Analysis Using Processor Partitioning We define processor partitioning as how many processors per node to be used for an application execution. A processor partitioning scheme NxM means N nodes with M processors per node. We can use processor partitioning to quantify the time difference among different processor partitioning schemes (PPS) and to investigate how the application and its components are sensitive to different communication patterns and memory access patterns for different PPS. For example, given the PPS NxM, when M=1, it means that the communication pattern is internode only, and only one processor on each node uses the whole memory. When M > 1, it means that there are internode and intranode communications, and M processors on each node competes the whole memory. In this section, we investigate how processor partitioning impacts the performance of the application and its components. We use MILC7.4 with input-1 (the large problem size of 4x4x4x4) on 256 processors on four large-scale clusters as an example to conduct our experiments. Runtime means the total application execution time which includes I/O time. Update() is one function of the MILC code. Others mean the other components of the MILC code except Update(). 6

7 MILC with Input-1 on RENCI BlueGene/L 3 25 Runtime (secs) Runtime Update() Others 5 256x1 Processor Partitioning Scheme (256) 128x2 Figure 7. Performance comparison for different PPS on 256 processors on UNC RENCI BlueGene/L Figure 7 shows the performance comparison for different PPS on 256 processors on UNC RENCI BlueGene/L. The time difference between the scheme 256x1 and 128x2 is 24.61%. MILC with Input-1 on NERSC Jacquard Runtime (secs) Runtime Update() Others x1 Processor Partitioning Scheme (256) 128x2 Figure 8. Performance comparison for different PPS on 256 processors on NERSC Jacquard Figure 8 shows the performance comparison for different PPS on 256 processors on NERSC Jacquard. The time difference between the scheme 256x1 and 128x2 is 19.92%. 7

8 MILC with Input-1 on NERSC Seaborg Runtime (secs) Runtime Update() Others 256x1 128x2 64x4 32x8 16x16 Processor Partitioning Scheme (256) Figure 9. Performance comparison for different PPS on 256 processors on NERSC Seaborg Figure 9 shows the performance comparison for different PPS on 256 processors on NERSC Seaborg. The time difference between the scheme 256x1 and 16x16 is 15.25%. MILC with Input-1 on SDSC DataStar Runtime (secs) Runtime Update() Others x1 128x2 64x4 32x8 Processor Partitioning Scheme (256) Figure 1. Performance comparison for different PPS on 256 processors on SDSC DataStar Figure 1 shows the performance comparison for different PPS on 256 processors on SDSC DataStar. The time difference between the scheme 256x1 and 32x8 is 16.35%. The function update() results in the time differences for different processor partitioning schemes. The function update() consists of dominated function update_h(), ks_congrad_two_src() and update_u(). There is a large double nested-loop in update_u(). For an example, the number of iterations is 16 for Input-1 and for Input Possible Performance Optimization Dominated function update(): update_h(), ks_congrad_two_src(), and update_u(). 8

9 Use loop blocking technique to optimize them. First, try to optimize update_u(), then consider how to optimize update_h() and ks_congrad_two_src() because the major function calls in the two functions are from QOPQDP library. Two optimization strategies for further work: Cache blocking technique at source code level Prefetch technique at compiling time (Carleton DeTar suggested) 5. Performance Modeling Using Prophesy System The Prophesy system [8] provides the PAIDE automatic performance instrumentation and data entry tool [7]. PAIDE provides the automatic instrumentation of codes at the level of basic blocks, procedures, or functions. The default mode consists of instrumenting the entire code at the level of basic loops and procedures. A user can specify that the code be instrumented at a finer granularity than that of loops or identify the particular events to be instrumented. The resultant performance data is automatically uploaded into the Prophesy database at the end of application execution by PAIDE, and is used to produce performance models and predict application performance via Prophesy web-based interfaces [9]. For MILC code with Input-1, we use Prophesy to illustrate the modeling process as follows. Figures 11 through 13 illustrate performance model and predicted performance on 496 processors based on the performance data on 8 to 248 processors. Figure 11. Performance model for MILC with Input-1 on RENCI BlueGene/L 9

10 Figure 12. Performance model for MILC with Input-1 on SDSC DataStar Figure 13. Performance model for MILC with Input-1 on NERSC Seaborg Figures 11 and 12 show more accurate performance models on BlueGene/L and Datastar, where the norm of residuals is less than 2 (or 2% of the total execution times). But the performance model in Figure 13 is not accurate, where the norm of residuals is (or 8.45% of the total execution time on 8 processors). 1

11 Figure 14. Application-level performance data stored in Prophesy Database Figure 15. Fucntion-level performance data stored in Prophesy Database Figures 14 indicates the brief application-level performance data stored in Prophesy database, and 15 shows the brief function-level performance data stored in Prophesy database. 11

12 Acknowledgements We would like to thank Ying Zhang from UNC RENCI for providing initial MILC code and datasets, and thank Carleton DeTar from University of Utah and Steven Gottlieb from Indiana University for their help about understanding the code and results. Summary We have carried out the following tasks: Compared the performance of the MILC code on 4 different clusters Discussed which platform performs the best for the MILC code, in other word, which platform is suitable to the SciDAC MILC application Investigated how processor partitioning impacts the performance of MILC in order to identify possible performance bottlenecks for further work Modeled the performance of MILC and predicted the performance on a larger number of processors References [1] The MIMD Lattice Computation (MILC) Collaboration code, [2] US Lattice Quantum Chromodynamics, [3] NERSC Jacquard, [4] UNC RENCI BlueGene/L, [5] NERSC Seaborg, resources/sp/. [6] SDSC DataStar, [7] Xingfu Wu, Valerie Taylor and Rick Stevens, Design and Implementation of Prophesy Automatic Instrumentation and Data Entry System, Proc. of the 13th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS21), Anaheim, CA, August 21. [8] Valerie Taylor, Xingfu Wu, and Rick Stevens, Prophesy: An Infrastructure for Performance Analysis and Modeling of Parallel and Grid Applications, ACM SIGMETRICS Performance Evaluation Review, Volume 3, Issue 4, March 23. [9] Xingfu Wu, Valerie Taylor, and Joseph Paris, A Web-based Prophesy Automated Performance Modeling System, the IASTED International Conference on Web Technologies, Applications and Services (WTAS26), July 17-19, 26, Calgary, Canada. 12

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems Xingfu Wu, Valerie Taylor, Charles Lively, and Sameh Sharkawi Department of Computer Science, Texas A&M