Scheduling on Asymmetric Parallel Architectures

Size: px

Start display at page:

Download "Scheduling on Asymmetric Parallel Architectures"

Milo Reynolds
5 years ago
Views:

1 Scheduling on Asymmetric Parallel Architectures Filip Blagojevic Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Applications Committee Members: Dimitrios S. Nikolopoulos (Chair) Kirk W. Cameron Wu-chun Feng David K. Lowenthal Calvin J. Ribbens May 30, 2008 Blacksburg, Virginia Keywords: Multicore processors, Cell BE, process scheduling, high-performance computing, performance prediction, runtime adaptation c Copyright 2008, Filip Blagojevic

2 Scheduling on Asymmetric Parallel Architectures Filip Blagojevic (ABSTRACT) We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. To maximize performance on heterogeneous multi-core processors, programs need to expose multiple dimensions of parallelism simultaneously. Unfortunately, programming with multiple dimensions of parallelism is to date an ad hoc process, relying heavily on the intuition and skill of programmers. Formal techniques are needed to optimize multi-dimensional parallel program designs. We investigate user- and kernel-level schedulers that dynamically rightsize the dimensions and degrees of parallelism on the asymmetric parallel platforms. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. Our runtime environment outperforms the native Linux and MPI scheduling environment by up to a factor of 2.7. We also present a model of multi-dimensional parallel computation for steering the parallelization process on heterogeneous multi-core processors. The model predicts with high accuracy the execution time and scalability of a program using conventional processors and accelerators simultaneously. More specifically, the model reveals optimal degrees of multi-dimensional, task-level and data-level concurrency, to maximize performance across cores. We evaluate our runtime policies as well as the performance model we developed, on an IBM Cell BladeCenter, as well as on a cluster composed of Playstation3 nodes, using two realistic bioinformatics applications.

3 ACKNOWLEDGMENTS I would like to thank my advisor Dr. Dimitrios S. Nikolopoulos for his guidance during my graduate studies. I would also like to thank Dr. Alexandros Stamatakis, Dr. Xizhou Feng, and Dr. Kirk Cameron for providing us with the original MPI implementations of PBPI and RAxML and for discussions on scheduling and modeling the Cell/BE. I would like to thank to the members of the PEARL group, Dr. Christos Antonopoulos, Dr. Matthew Curtis-Maury, Scott Schneider, Jae-Sung Yeom, and Benjamin Rose, for their involvement in the projects presented in this dissertation. I would also like to thank my Ph.D. committee for their discussion and suggestions for this work: Dr. Kirk W. Cameron, Dr. Davd Lowenthal, Dr. Wu-chun Feng, and Dr. Calvin J. Ribbens. Also, I thank Georgia Tech, its Sony-Toshiba-IBM Center of Competence, and NSF, for the Cell/BE resources that have contributed to this research. Finally, I would like to thank the institutions that have funded this research: the National Science Foundation and the U.S. Department of Energy. iii

4 This page intentionally left blank. iv

5 Contents 1 Problem Statement Mapping Parallelism to Asymmetric Parallel Architectures Statement of Objectives Dynamic Multigrain Parallelism Rightsizing Multigrain Parallelism MMGP Model Experimental Testbed RAxML PBPI Hardware Platform Code Optimization Methdologies for Asymmetric Multi-core Systems with Explicitly Managed Memories Porting and Optimizing RAxML on Cell Function Off-loading Optimizing Off-Loaded Functions Vectorizing Conditional Statements Double Buffering and Memory Management Vectorization PPE-SPE Communication Increasing the Coverage of Offloading Parallel Execution Chapter Summary Scheduling Multigrain Parallelism on Asymmetric Systems Introduction v

6 5.2 Scheduling Multi-Grain Parallelism on Cell Event-Driven Task Scheduling Scheduling Loop-Level Parallelism Implementing Loop-Level Parallelism Dynamic Scheduling of Task- and Loop-Level Parallelism Application-Specific Hybrid Parallelization on Cell MGPS S-MGPS Motivating Example Sampling-Based Scheduler for Multi-grain Parallelism Chapter Summary Model of Multi-Grain Parallelism Introduction Modeling Abstractions Hardware Abstraction Application Abstraction Model of Multi-grain Parallelism Modeling sequential execution Modeling parallel execution on APUs Modeling parallel execution on HPUs Using MMGP MMGP Extensions Experimental Validation and Results MMGP Parameter approximation Case Study I: Using MMGP to parallelize PBPI Case Study II: Using MMGP to Parallelize RAxML MMGP Usability Study Chapter Summary Scheduling Asymmetric Parallelism on a PS3 Cluster Introduction Experimental Platform PS3 Cluster Scalability Study MPI Communication Performance Application Benchmarks vi

7 7.4 Modeling Hybrid Parallelism Modeling PPE Execution Time Modeling the off-loaded Computation DMA Modeling Cluster Execution Modeling Verification Co-Scheduling on Asymmetric Clusters PS3 versus IBM QS20 Blades Chapter Summary Kernel-Level Scheduling Introduction SLED Scheduler Overview ready to run List ready to run List Organization Splitting ready to run List SLED Scheduler - Kernel Level SLED Scheduler - User Level Experimental Setup Benchmarks Microbenchmarks PBPI RAxML Chapter Summary Future Work Integrating ready-to-run list in the Kernel Load Balancing and Task Priorities Increasing Processor Utilization Novel Applications and Programming Models Conventional Architectures MMGP extensions Overview of Related Research Cell Related Research Process Scheduling - Related Research vii

8 10.3 Modeling Related Research PRAM Model BSP model LogP model Models Describing Nested Parallelism Bibliography 147 viii

9 List of Figures 2.1 A hardware abstraction of an accelerator-based architecture. Host processing units (HPUs) supply coarse-grain parallel computation across accelerators. Accelerator processing units (APUs) are the main computation engines and may support internally finer grain parallelism Organization of Cell The likelihood vector structure is used in almost all memory traffic between main memory and the local storage of the SPEs. The structure is 128-bit aligned, as required by the Cell architecture The body of the first loop in newview(): a) Non vectorized code, b) Vectorized code The second loop in newview(). Non vectorized code shown on the left, vectorized code shown on the right. spu madd() multiplies the first two arguments and adds the result to the third argument. spu splats() creates a vector by replicating a scalar element Performance of (a) RAxML and (b) PBPI with different number of MPI processes Scheduler behavior for two off-loaded tasks, representative of RAxML. Case (a) illustrates the behavior of the EDTLP scheduler. Case (b) illustrates the behavior of the Linux scheduler with the same workload. The numbers correspond to MPI processes. The shaded slots indicate context switching. The example assumes a Cell-like system with four SPEs Parallelizing a loop across SPEs using a work-sharing model with an SPE designated as the master ix

10 5.3 The data structure Pass is used for communication among SPEs. The v i ad variables are used to pass input arguments for the loop body from one local storage to another. The variable sig is used as a notification signal that the memory transfer for the shared data updated during the loop is completed. The variable res is used to send results back to the master SPE, and as a dependence resolution mechanism Parallelization of the loop from function evaluate() in RAxML. The left side depitcs the code executed by the master SPE, while the right side depitcs the code executed by a worker SPE. Num SPE represents the number of SPE worker threads Comparison of task-level and hybrid parallelization schemes in RAxML, on the Cell BE. The input file is 42 SC. The number of ML trees created is (a) 1 16, (b) MGPS, EDTLP and static EDTLP-LLP. Input file: 42 SC. Number of ML trees created: (a) 1 16, (b) Execution time of RAxML with a variable number of SPE threads. The input dataset is 25 SC Execution times of RAxML, with various static multi-grain scheduling strategies. The input dataset is 25 SC The sampling phase of S-MGPS. Samples are taken from four execution intervals, during which the code performs identical operations. For each sample, each MPI process uses a variable number of SPEs to parallelize its enclosed loops PBPI executed with different levels of TLP and LLP parallelism: deg(tlp)=1-4, deg(llp)= A hardware abstraction of an accelerator-based architecture with two layers of parallelism. Host processing units (HPUs) relatively supply coarse-grain parallel computation across accelerators. Accelerator processing units (APUs) are the main computation engines and may support internally finer grain parallelism. Both HPUs and APUs have local memories and communicate through shared-memory or message-passing. Additional layers of parallelism can be expressed hierarchically in a similar fashion x

11 6.2 Our application abstraction of two parallel tasks. Two tasks are spawned by the main process. Each task exhibits phased, multi-level parallelism of varying granularity. In this paper, we address the problem of mapping tasks and subtasks to accelerator-based systems The sub-phases of a sequential application are readily mapped to HPUs and APUs. In this example, sub-phases 1 and 3 execute on the HPU and sub-phase 2 executes on the APU. HPUs and APUs are assumed to communicate via shared memory Parallel APU execution. The HPU (leftmost bar in parts a and b) offloads computations to one APU (part a) and two APUs (part b). The single point-to-point transfer of part a is modeled as overhead plus computation time on the APU. For multiple transfers, there is additional overhead (g), but also benefits due to parallelization Parallel HPU execution. The HPU (center bar) offloads computations to 4 APUs (2 on the right and 2 on the left). The first thread on the HPU offloads computation to APU1 and APU2 then idles. The second HPU thread is switched in, offloads code to APU3 and APU4, and then idles. APU1 and APU2 complete and return data followed by APU3 and APU MMGP predictions and actual execution times of PBPI, when the code uses one dimension of PPE (HPU) parallelism MMGP predictions and actual execution times of PBPI, when the code uses one dimension of SPE (APU) parallelism, with a data-parallel implementation of the maximum likelihood calculation MMGP predictions and actual execution times of PBPI, when the code uses two dimensions of SPE (APU) and PPE (HPU) parallelism. The mix of degrees of parallelism which optimizes performance is 4-way PPE parallelism combined with 4-way SPE parallelism. The chart illustrates the results when both SPE parallelism and PPE parallelism are scaled to two Cell processors MMGP predictions and actual execution times of RAxML, when the code uses one dimension of PPE (HPU) parallelism: (a) with DS1, (b) with DS MMGP predictions and actual execution times of RAxML, when the code uses one dimension of SPE (APU) parallelism: (a) with DS1, (b) with DS MMGP predictions and actual execution times of RAxML, when the code uses two dimensions of SPE (APU) and PPE (HPU) parallelism. Performance is optimized by oversubscribing the PPE and maximizing task-level parallelism.. 82 xi

12 6.12 Overhead of the sampling phase when MMGP scheduler is used with the PBPI application. PBPI is executed multiple times with 107 input species. The sequence size of the input file is varied from 1,000 to 10,000. In the worst case, the overhead of the sampling phase is 2.2% (sequence size 7,000) MPI Allreduce() performance on the PS3 cluster. Processes are distributed evenly between nodes. Each node runs up to 6 processes, using shared memory for communication within the node MPI Send/Recv() latency on the PS3 cluster. Processes are distributed evenly between nodes. Each node runs up to 6 processes, using shared memory for communication within the node Measured and predicted performance of applications on the PS3 cluster. PBPI is executed with weak scaling. RAxML is executed with strong scaling. x-axis notation: N node - number of nodes, N process - number of processes per node, N SP E - number of SPEs per process Four cases illustrating the importance of co-scheduling PPE threads and SPE threads. Threads labeled P are PPE threads, while threads labeled S are SPE threads. We assume that P-threads and S-threads communicate through shared memory. P-threads poll shared memory locations directly to detect if a previously off-loaded S-thread has completed. Striped intervals indicate yielding of the PPE, dark intervals indicate computation leading to a thread off-load on an SPE, light intervals indicate computation yielding the PPE without offloading on an SPE. Stars mark cases of mis-scheduling SPE execution Double buffering template for tiled parallel loops Performance of yield-if-not-ready policy and the native Linux scheduler in PBPI and RAxML. x-axis notation: N node - number of nodes, N process - number of processes per node, N SP E - number of SPEs per process Performance of different scheduling strategies in PBPI and RAxML Comparison between the PS3 cluster and an IBM QS20 cluster Upon completing the assigned tasks, the SPEs send signal to the PPE processes through the ready-to-run list. The PPE process which decides to yield passes the data from the ready-to-run list to the kernel, which in return can schedule the appropriate process on the PPE xii

13 8.2 Vertical overview of the SLED scheduler. The user level part contains the readyto-run list, shared among the processes, while the kernel part contains the system call through which the information from the ready-to-run list is passed to the kernel ProcessP 1, which is bound to CP U 1, needs to be scheduled to run by the scheduler that was invoked on CP U 2. Consequently, the kernel needs to perform migration of the process P 1, from CP U 1 to CP U System call for migrating the processes across the execution contexts. Function sched migrate task() performs the actual migration. SLEDS yield() function schedules the process to be the next to run on the CPU The ready to run list is split in two parts. Each of the two sublists contain processes that are sharing the execution context (CP U 1 or CP U 2 ). This approach avoids any possibility of expensive process migration across the execution contexts Execution flow of the SLEDS yield() function: (a) The appropriate process is found in the running list (tree), (b) The process is pulled out from the list, and its priority is increased, (c) The process is returned to the list, and since its priority is increased it will be stored at the left most position Outline of the SLEDS scheduler: Upon off-loading a process is required to call the SLEDS Offload() function. SLEDS Offload() checks if the off-loaded task has finished (Line 14), and if not, calls the yield() function. yield() scans the ready to run list, and yields to the next process by executing SLEDS yield() system call Execution times of RAxML when the ready to run list is scanned between 50 and 1000 times. x-axis represents the number of scans of the ready to run list. y-axis represents the execution time. Note that the lowest value for the y-axis is 12.5, and the difference between the lowest and the highest execution time is 4.2%. The input file contains 10 species, each represented by 1800 nucleotides Comparison of the EDTLP and SLED schemes using microbenchmarks: Total execution time is measured as the length of the off-loaded tasks is increased Comparison of the EDTLP and SLED schemes using microbenchmarks: Total execution time is measured as the length of the off-loaded tasks is increased task size is limited to 2.1us EDTLP outperforms SLED for small task sizes due to higher complexity of the SLED scheme xiii

14 8.12 Comparison of the EDTLP scheme and the combination of SLED and EDTLP schemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs Comparison of the EDTLP scheme and the combination of SLED and EDTLP schemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs task size is limited to 2.µs Comparison of EDTLP and SLED schemes using the PBPI application. The application is executed multiple times with varying length of the input sequence (represented on the x-axis) Comparison of EDTLP and the combination of SLED and EDTLP schemes using the PBPI application. The application is executed multiples time with varying length of the input sequence (represented on the x-axis) Comparison of EDTLP and SLED schemes using the RAxML application. The application is executed multiple times with varying length of the input sequence (represented on the x-axis) Comparison of EDTLP and the combination of SLED and RAxML schemes using the RAxML application. The application is executed multiple times with varying length of the input sequence (represented on the x-axis) Upon completing the assigned tasks, SPEs send signals to PPE processes through the ready-to-run list. The PPE process which decides to yield passes the data from the ready-to-run queue to the kernel, which in return can schedule the appropriate process on the PPE xiv

15 List of Tables 4.1 Execution time of RAxML (in seconds). The input file is 42 SC. (a) The whole application is executed on the PPE, (b) newview() is offloaded on one SPE Execution time of RAxML after the floating-point conditional statement is transformed to an integer conditional statement and vectorized. The input file is 42 SC Execution time of RAxML with double buffering applied to overlap DMA transfers with computation. The input file is 42 SC Execution time of RAxML following vectorization. The input file is 42 SC Execution time of RAxML following the optimization of communication to use direct memory-to-memory transfers. The input file is 42 SC Execution time of RAxML after offloading and optimizing three functions: newview(), makenewz() and evaluate(). The input file is 42 SC Performance comparison for (a) RAxML and (b) PBPI with two schedulers. The second column shows execution time with the EDTLP scheduler. The third column shows execution time with the native Linux kernel scheduler. The workload for RAxML contains 42 organisms. The workload for PBPI contains 107 organisms Execution time of RAxML when loop-level parallelism (LLP) is exploited in one bootstrap, via work distribution between SPEs. The input file is 42 SC: (a) DNA sequences are represented with 10,000 nucleotides, (b) DNA sequences are represented with 20,000 nucleotides Execution time of PBPI when loop-level parallelism (LLP) is exploited via work distribution between SPEs. The input file is 107 SC: (a) DNA sequences are represented with 1,000 nucleotides, (b) DNA sequences are represented with 10,000 nucleotides xv

16 5.4 Efficiency of different program configurations with two data sets in RAxML. The best configuration for 42 SC input is deg(tlp)=8, deg(llp)=1. The best configuration for 25 SC is deg(tlp)=4, deg(llp)=2. deg() corresponds the degree of a given dimension of parallelism (LLP or TLP) RAxML Comparison between S-MGPS and static scheduling schemes, illustrating the convergence overhead of S-MGPS PBPI comparison between S-MGPS and static scheduling schemes: (a) deg(tlp)=1, deg(llp)=1 16; (b) deg(tlp)=2, deg(llp)=1 8; (c) deg(tlp)=4, deg(llp)=1 4; (d) deg(tlp)=8, deg(llp)= xvi

17 Chapter 1 Problem Statement In the quest for delivering higher performance to scientific applications, hardware designers began to move away from superscalar processor models and embraced architectures with multiple processing cores. Although all commodity microprocessor vendors are marketing multicore processors, these processors are largely based on replication of superscalar cores. Unfortunately, superscalar designs exhibit well-known performance and power limitations. These limitations, in conjunction with a sustained requirement for higher performance, stimulated interest in unconventional processor designs, that combine parallelism with acceleration. These designs leverage multiple cores some of which are customized accelerators for data-intensive computation. Examples of these heterogeneous, accelerator-based parallel architectures are Cell BE [3], GPGPU [4], Rapport KiloCore [2], EXOCHI [96], etc. As a case study and a representative of the accelerator-based asymmetric architectures, in this dissertation we investigate the Cell Broadband Engine (CBE). Cell has recently drawn considerable attention by industry and academia. Since it was originally designed for the game box market, Cell has low cost and a modest power budget. Nevertheless, the processor is able to achieve unprecedented peak performance for some real-world applications. IBM announced recently the use of Cell chips in a new Petaflop system with 16,000 Cells named RoadRunner, due for delivery in

18 The potential of the Cell BE has been demonstrated convincingly in a number of studies [33,39,69,74,91]. Thanks to eight high-frequency execution cores with pipelined SIMD capabilities, and an aggressive data transfer architecture, Cell has a theoretical peak performance of over 200 Gflops for single-precision FP calculations and a peak memory bandwidth of over 25 Gigabytes/s. These performance figures position Cell ahead of the competition against the most powerful commodity microprocessors. Cell has already demonstrated impressive performance ratings in applications and computational kernels with highly vectorizable data parallelism, such as signal processing, compression, encryption, dense and sparse numerical kernels [12, 13, 15, 39, 48, 49, 66, 75, 78, 79, 99]. 1.1 Mapping Parallelism to Asymmetric Parallel Architectures Arguably, one of the most difficult problems that programmers face while migrating to a new parallel architecture is the mapping of algorithms and data to the architecture. Acceleratorbased multi-core processors complicate this problem in two ways. Firstly, by introducing heterogeneous execution cores, the user needs to be concerned with mapping each component of the application to the type of core that best matches the computational and memory bandwidth demand of the component. Secondly, by providing multiple cores with embedded SIMD or multi-threading capabilities, the user needs to be concerned with extracting multiple dimensions of parallelism from the application and mapping each dimension to parallel execution units, so as to maximize performance. Cell provides a motivating and timely example for the problem of mapping algorithmic parallelism to modern multi-core architectures. The processor can exploit task and data parallelism, both across and within its cores. On accelerator-based multi-core architectures the programmer must be aware of core heterogeneity, and carefully balance execution between the 2

19 host and accelerator cores. Furthermore, the programmer faces a seemingly vast number of options for parallelizing code on these architectures. Functional and data decompositions of the program can be implemented on both, the host and the accelerator cores. Functional decompositions can be achieved by dividing functions between the hosts and the accelerators and by off-loading functions from the hosts to accelerators at runtime. Data decompositions are also possible, by using SIMDization on the vector units of the accelerator cores, or loop-level parallelization across accelerators, or a combination of loop-level parallelization across accelerators and SIMDization within accelerators. In this thesis we explore different approaches used to automatize mapping applications to asymmetric parallel architectures. We explore both, runtime and static approaches for combining and managing functional and data decomposition. We combine and orchestrate multiple levels of parallelism inside an application in order to achieve both, harmoniously utilization of all host and accelerator cores, as well as high memory bandwidth available on asymmetric multi-core processors. Although we chose Cell to be our case study, our scheduling algorithms and decisions are general and can be applied to any asymmetric parallel architecture. 3

20 4

21 Chapter 2 Statement of Objectives 2.1 Dynamic Multigrain Parallelism While many studies have been focused on performance evaluation and optimizations for the heterogeneous multi-core architectures [23, 31, 54, 63, 65, 74, 98], the optimal mapping of parallel applications to these architectures has not been investigated. In this thesis we explore heterogeneous multi-core architectures from a different perspective, namely that of multigrain parallelization. The asymmetric parallel architectures have a specific design, they can exploit orthogonal dimensions of task and data parallelism on a single chip. The processor is controlled by one or more host processing elements, which usually schedule the computation off-loaded to accelerator processing units. The accelerators are usually SIMD processors and provide the bulk of the processor s computational power. A general design of heterogeneous, accelerator-based architectures is represented in Figure 2.1. To simplify programming and improve efficiency on asymmetric parallel architectures, we present a set of dynamic scheduling policies and the associated mechanisms. We introduce an event-driven scheduler, EDTLP, which oversubscribes the host processing cores and exposes dynamic parallelism across accelerators. We also propose MGPS, a scheduling module which controls multi-grain parallelism on the fly to monotonically increase accelerator utilization. 5

22 HPU/LM #1 HPU/LM #N HP Shared Memory / Message Interface APU/LM #1 APU/LM #2 APU/LM #N AP Figure 2.1: A hardware abstraction of an accelerator-based architecture. Host processing units (HPUs) supply coarse-grain parallel computation across accelerators. Accelerator processing units (APUs) are the main computation engines and may support internally finer grain parallelism. MGPS monitors the number of active accelerators used by off-loaded tasks over discrete intervals of execution and makes a prediction on the best combination of dimensions and granularity of parallelism to expose to the hardware. The purpose of these policies is to exploit the proper layers and degrees of parallelism from the application, in order to maximize efficiency of the processor s computational cores. We explore the design and implementation of our scheduling policies using two real-world scientific applications, RAxML [87] and PBPI [45]. RAxML and PBPI are bioinformatics applications used for generating the phylogenetic trees, and we describe them in more detail in Chapter 3. One of the most efficient execution models on asymmetric parallel architectures, which reduces the idle time on the host processors as well as on the accelerators, is to oversubscribe the host processors unit with multiple processes. In this approach one or more accelerators are assigned to each process for off-loading the expensive computation. Although the offloading approach enables high utilization of the architecture, it also increases contention and the number of context-switches on the host processor unit, as well as time necessary of a single contextswitch to complete. To reduce the contention caused by context switching, and the idle time that occurs on the accelerator cores as a consequence, we designed and implemented slack- 6

23 minimizer scheduler (SLED). In our case study, the SLED scheduler is capable of improving the performance on the Cell processor for up to 17%. The study related to dynamic scheduling strategies makes the following contributions: We present a runtime system and scheduling policies that exploit polymorphic (task and loop-level) parallelism on asymmetric parallel processors. Our runtime system is adaptive, in the sense that it chooses the form and degree of parallelism to expose to the hardware, in response to workload characteristics. Since the right choice of form(s) and degree(s) of parallelism depends non-trivially on workload characteristics and user input, our runtime system unloads an important burden from the programmer. We show that dynamic multigrain parallelization is a necessary optimization for sustaining maximum performance on asymmetric parallel architectures, since no static parallelization scheme is able to achieve high accelerator efficiency in all cases. We present an event-driven multithreading execution engine, which achieves higher efficiency on accelerators by oversubscribing the host core. We present a feedback-guided scheduling policy for dynamically triggering and throttling loop-level parallelism across accelerators. We show that work-sharing of divisible tasks across accelerators should be used when the event-driven multithreading engine leaves more than half of the accelerators idle. We observe benefits from loop-level parallelization of off-loaded tasks across accelerators. However, we also observe that loop-level parallelism should be exposed only in conjunction with low-degree task-level parallelism. We present the kernel-level extensions to our runtime system, which enable efficient process scheduling in a case when the host core is oversubscribed with multiple processes 7

24 2.2 Rightsizing Multigrain Parallelism When executing multi-level parallel application on asymmetric parallel processors, the performance can be strongly affected by the execution configuration. In case of RAxML execution on the Cell processor, depending on the runtime degree of each level of parallelism in the application, the performance variation can be as high as 40%. To address the issue of determining the most optimal parallel configuration, we introduce a new runtime scheduler, S-MGPS, which performs sampling and timing of the dominant phases in the application in order to determine the most efficient mapping of different levels of parallelism to the architecture. There are several essential differences between S-MGPS and our previously introduced runtime scheduler, MGPS. MGPS is a utilization-driven scheduler, which seeks the highest possible accelerator utilization by exploiting additional layers of parallelism when some accelerator cores appear underutilized. MGPS attempts to increase utilization by creating more accelerator tasks from innermost layers of parallelism, more specifically, as many tasks as the number of idle accelerators recorded during intervals of execution. S-MGPS is a scheduler which seeks the optimal application-system configuration, in terms of layers of parallelism exposed to the hardware and degree of granularity per layer of parallelism, based on runtime task throughput of the application and regardless of system utilization. S-MGPS takes into account the cumulative effects of contention and other system bottlenecks on software parallelism and can converge to the best multi-grain parallel execution algorithm. MGPS on the other hand uses only information on SPE utilization and may often converge to a suboptimal multi-grain parallel execution algorithm. A further contribution of S-MGPS is that the scheduler is immune to the initial configuration of parallelism in the application and uses a sampling method which is independent of application-specific parameters, or input. On the contrary, the performance of MGPS is sensitive to both the initial structure of parallelism in the application and input. Although the scientific codes we use in this thesis implement similar functionality, they differ in their structure and parallelization strategies and raise different challenges for user-level 8

25 schedulers. We show that S-MGPS performs within 2% off the optimal scheduling algorithm in PBPI and within 2% 10% off the optimal scheduling algorithm in RAxML. We also show that S-MGPS adapts well to variation of the input size and granularity of parallelism, whereas the performance of MGPS is sensitive to both these factors. 2.3 MMGP Model The technique used by the S-MGPS scheduler might not be scalable to large, complex systems, large applications, or applications with behavior that varies significantly with the input. The execution time of a complex application is the function of many parameters. A given parallel application may consist of N phases where each phase is affected differently by accelerators. Each phase can exploit d dimensions of parallelism or any combination thereof such as ILP, TLP, or both. Each phase or dimension of parallelism can use any of m different programming and execution models such as message passing, shared memory, SIMD, or any combination thereof. Accelerator availability or use may consist of c possible configurations, involving different numbers of accelerators. Exhaustive analysis of the execution time for all combinations requires at least N d m c trials with any given input. Models of parallel computation have been instrumental in the adoption and use of parallel systems. Unfortunately, commonly used models [24,35] are not directly portable to acceleratorbased systems. First, the heterogeneous processing common to these systems is not reflected in most models of parallel computation. Second, current models do not capture the effects of multi-grain parallelism. Third, few models account for the effects of using multiple programming models in the same program. Parallel programming at multiple dimensions and with a synthesis of models consumes both enormous amounts of programming effort and significant amounts of execution time, if not handled with care. To overcome these deficits, we present a model for multi-dimensional parallel computation on asymmetric multi-core processors. Considering that each dimension of parallelism reflects a different degree of computation granular- 9

26 ity, we name the model MMGP, for Model of Multi-Grain Parallelism. MMGP is an analytical model which formalizes the process of programming acceleratorbased systems and reduces the need for exhaustive measurements. This proposal presents a generalized MMGP model for accelerator-based architectures with one layer of host processor parallelism and one layer of accelerator parallelism, followed by the specialization of this model for the Cell Broadband Engine. The input to MMGP is an explicitly parallel program, with parallelism expressed with machine-independent abstractions, using common programming libraries and constructs. Upon identification of a few key parameters of the application derived from micro-benchmarking and profiling of a sequential run, MMGP predicts with reasonable accuracy the execution time with all feasible mappings of the application to host processors and accelerators. MMGP is fast and reasonably accurate, therefore it can be used to quickly identify optimal operating points, in terms of the exposed layers of parallelism and the degree of parallelism in each layer, on accelerator-based systems. Experiments with two complete applications from the field of computational phylogenetics on a shared-memory multiprocessor with single and multiple nodes that contain the Cell BE, show that MMGP models parallel execution time of complex parallel codes with multiple layers of task and data parallelism, with mean error in the range of 1% 6%, across all feasible program configurations on the target system. Due to the narrow margin of error, MMGP predicts accurately the optimal mapping of programs to cores for the cases we have studied so far. 10

27 Chapter 3 Experimental Testbed This chapter provides details on our experimental testbed, including the two applications that we used to study user-level schedulers on the Cell BE (RAxML and PBPI) and the hardware platform on which we conducted this thesis. RAxML and PBPI are computational biology applications designed to determine the phylogenetic trees. Phylogenetic trees are used to represent the evolutionary history of a set of n organisms. An alignment with the DNA or AA sequences representing those n organisms (also called taxa) can be used as input for the computation of phylogenetic trees. In a phylogeny the organisms of the input data set are located at the tips (leaves) of the tree whereas the inner nodes represent extinct common ancestors. The branches of the tree represent the time which was required for the mutation of one species into another, new one. The generation of phylogenies with computational methods has many important applications in medical and biological research (see [14] for a summary). The fundamental algorithmic problem computational phylogeny faces consists of the immense amount of alternative tree topologies which grows exponentially with the number of organisms n, e.g. for n = 50 organisms there exist alternative trees (number of atoms in the universe ). In fact, it has only recently been shown that the phylogeny problem is NP-hard [34]. In addition, generating phylogenies is very memory- and floating point-intensive 11

28 process, such that the application of high performance computing techniques as well as the assessment of new CPU architectures can contribute significantly to the reconstruction of larger and more accurate trees. The computation of the phylogenetic tree containing representatives of all living beings on earth is still one of the grand challenges in Bioinformatics. 3.1 RAxML RAxML-VI-HPC (v2.1.3) (Randomized Axelerated Maximum Likelihood version VI for High Performance Computing) [87] is a program for large-scale ML-based (Maximum Likelihood [43]) inference of phylogenetic (evolutionary) trees using multiple alignments of DNA or AA (Amino Acid) sequences. The program is freely available as open source code at icwww.epfl.ch/ stamatak. The current version of RAxML incorporates a rapid hill climbing search algorithm. A recent performance study [87] on real world datasets with 1,000 sequences reveals that it is able to find better trees in less time and with lower memory consumption than other current ML programs (IQPNNI, PHYML, GARLI). Moreover, RAxML-VI-HPC has been parallelized with MPI (Message Passing Interface), to enable embarrassingly parallel non-parametric bootstrapping and multiple inferences on distinct starting trees in order to search for the best-known ML tree. Like every ML-based program, RAxML exhibits a source of fine-grained loop-level parallelism in the likelihood functions which consume over 90% of the overall computation time. This source of parallelism scales well on large memory-intensive multi-gene alignments due to increased cache efficiency. The MPI version of RAxML is the basis of our Cell version of the code [20]. In RAxML multiple inferences on the original alignment are required in order to determine the best-known (best-scoring) ML tree (we use the term best-known because the problem is NP-hard). Furthermore, bootstrap analyses are required to assign confidence values ranging between 0.0 and 1.0 to the internal branches of the best-known ML tree. This allows determining how wellsupported certain parts of the tree are and is important for the biological conclusions drawn 12

29 from it. All those individual tree searches, be it bootstrap or multiple inferences, are completely independent from each other and can thus be exploited by a simple master-worker MPI scheme. Each search can further exploit data parallelism via thread-level parallelization of loops and/or SIMDization. 3.2 PBPI PBPI is based on Bayesian phylogenetic inference, which constructs phylogenetic trees from DNA or AA sequences using the Markov Chain Monte Carlo (MCMC) sampling method. The program is freely available as open source code at MCMC method is inherently sequential, and the state of each time step depends on previous time steps. Therefore, the PBPI application uses algorithmic improvements described below to achieve highly efficient parallel inference of phylogenetic trees. PBPI exploits multi-grain parallelism, to achieve scalability on large-scale distributed memory systems, such as the IBM BlueGene/L [45]. The algorithm of PBPI can be summarized as follows: 1. Partition the Markov chains into chain groups, and split the data set into segments along the sequences. 2. Organize the virtual processors that execute the code into a two-dimensional grid; map each chain group to a row on the grid and map each segment to a column on the grid. 3. During each generation, compute the partial likelihood across all columns and use all-toall communication to collect the complete likelihood values to all virtual processors on the same row. 4. When there are multiple chains, randomly choose two chains for swapping using pointto-point communication. 13

30 I/O Controller SPE LS SPE SPE SPE LS LS LS PowerPC PPE Element Interconnect BUS (EIB) LS LS LS LS Memory Controller SPE SPE SPE SPE Figure 3.1: Organization of Cell. From a computational perspective, PBPI differs substantially from RAxML. While RAxML is embarrassingly parallel, PBPI uses a predetermined virtual processor topology and a corresponding data decomposition method. While the degree of task parallelism in RAxML may vary considerably at runtime, PBPI exposes from the beginning of execution, a high-degree of two-dimensional data parallelism to the runtime system. On the other hand, while the degree of task parallelism can be controlled dynamically in RAxML without performance penalty, in PBPI changing the degree of outermost data parallelism requires data redistribution and incurs a high performance penalty. 3.3 Hardware Platform The Cell BE is a heterogeneous multi-core processor which integrates a simultaneous multithreading PowerPC core ( the Power Processing Element or PPE), and eight specialized accelerator cores (the Synergistic Processing Elements or SPEs) [40]. These elements are connected in a ring topology on an on-chip network called the Element Interconnect Bus (EIB). The organization of Cell is illustrated in Figure 3.1. The PPE is a 64-bit SMT processor running the PowerPC ISA, with vector/simd multimedia extensions [71]. The PPE has two levels of on-chip cache. The L1-I and L1-D caches of the PPE have a capacity of 32 KB. The L2 cache of the PPE has a capacity of 512 KB. 14

31 Each SPE is a 128-bit vector processor with two major components: a Synergistic Processor Unit (SPU) and a Memory Flow Controller (MFC). All instructions are executed on the SPU. The SPU includes 128 registers, each 128 bits wide, and 256 KB of software-controlled local storage. The SPU can fetch instructions and data only from its local storage and can write data only to its local storage. The SPU implements a Cell-specific set of SIMD intrinsics. All single precision floating point operations on the SPU are fully pipelined and the SPU can issue one single-precision floating point operation per cycle. Double precision floating point operations are partially pipelined and two double-precision floating point operations can be issued every six cycles. Double-precision FP performance is therefore significantly lower than single-precision FP performance. With all eight SPUs active and fully pipelined double precision FP operation, the Cell BE is capable of a peak performance of Gflops. In single-precision FP operation, the Cell BE is capable of a peak performance of Gflops [33]. The SPE can access RAM through direct memory access (DMA) requests. DMA transfers are handled by the MFC. All programs running on an SPE use the MFC to move data and instructions between local storage and main memory. Data transferred between local storage and main memory must be 128-bit aligned. The size of each DMA transfer can be at most 16 KB. DMA-lists can be used for transferring more than 16 KB of data. A list can have up to 2,048 DMA requests, each for up to 16 KB. The MFC supports only DMA transfer sizes that are 1, 2, 4, 8 or multiples of 16 bytes long. The EIB is an on-chip coherent bus that handles communication between the PPE, SPE, main memory, and I/O devices. Physically, the EIB is a 4-ring structure, which can transmit 96 bytes per cycle, for a maximum theoretical memory bandwidth of Gigabytes/second. The EIB can support more than 100 outstanding DMA requests. In this work we are using a Cell blade (IBM BladeCenter QS20) with two Cell BEs running at 3.2 GHz, and 1GB of XDR RAM (512 MB per processor). The PPEs run Linux Fedora Core 6. We use IBM SDK2.1 and Lam/MPI

32 16

33 Chapter 4 Code Optimization Methdologies for Asymmetric Multi-core Systems with Explicitly Managed Memories Accelerator-based architectures with explicitly managed memories have the advantage of achieving a high degree of communication-computation overlap. While this is a highly desirable goal in high-performance computing, it is also a significant drawback prom the programability perspective. Managing all memory accesses from the application level significantly increases the complexity of the written code. In our work, we investigate the execution models that reduce the complexity of the code written for the asymmetric architectures, but still achieve desirable performance and high utilization of the available architectural resources. We investigate a set of optimizations that have the most significant impact on the performance of scientific applications executed on the asymmetric architectures. In our case study, we investigate the optimization process which enables efficient execution of RAxML and PBPI on the Cell architecture. The results presented in this chapter indicate that RAxML and PBPI are highly optimized for Cell, and also motivate the discussion presented in the rest of the thesis. Cell-specific optimization applied to the two bioinformatics applications resulted in more than two times speedup. At the same time, we show that regardless of being extensively optimized for sequential execution, parallel applications demand sophisticated scheduling support for efficient parallel execution on heterogeneous multi-core platforms. 17

34 4.1 Porting and Optimizing RAxML on Cell We ported RAxML to Cell in four steps: 1. We ported the MPI code on the PPE; 2. We offloaded the most time-consuming parts of each MPI process on the SPEs; 3. We optimized the SPE code using vectorization of floating point computation, vectorization of control statements coupled with a specialized casting transformation, overlapping of computation and communication (double buffering) and other communication optimizations; 4. Lastly, we implemented multi-level parallelization schemes across and within SPEs in selected cases, as well as a scheduler for effective simultaneous exploitation of task, loop, and SIMD parallelism. We outline optimizations 1-3 in the rest of the chapter. We focus on multi-level parallelization, as well as different scheduling policies in Chapter Function Off-loading We profiled the application using gprofile to identify the computationally intensive functions that could be candidates for offloading and optimization on SPEs. We used an IBM Power5 processor for profiling RAxML. For the profiling and benchmarking runs of RAxML presented in this chapter, we used the input file 42 SC, which contains 42 organisms, each represented by a DNA sequence of 1167 nucleotides. The number of distinct data patterns in a DNA alignment is on the order of 250. On the IBM Power5, 98.77% of the total execution time is spent in three functions: 77.24% in newview() - which computes the partial likelihood vector [44] at an inner node of the phylogenetic tree, 18

35 19.16% in makenewz() - which optimizes the length of a given branch with respect to the tree likelihood using the Newton Raphson method, 2.37% in evaluate() - which calculates the log likelihood score of the tree at a given branch by summing over the partial likelihood vector entries. These functions are the best candidates for offloading on SPEs. The prerequisite for computing evaluate() and makenewz() is that the likelihood vectors at the nodes of the phylogenetic tree that are right and left of the current branch have been computed. Thus, makenewz() and evaluate() initially make calls to newview(), before they can execute their own computation. The newview() function at an inner node p of a tree, calls itself recursively when the two children r and q are not tips (leaves) and the likelihood array for r and q has not already been computed. Consequently, the first candidate for offloading is newview(). Although makenewz() and evaluate() are both taking a smaller portion of the execution time than newview(), offloading these two functions results in significant speedup (see Section 4.2.6). Besides the fact that each function can be executed faster on an SPE, having all three functions offloaded to an SPE reduces significantly the amount of PPE-SPE communication. In order to have a function executed on an SPE, we spawn an SPE thread at the beginning of each MPI process. The thread executes the offloaded function upon receiving a signal from the PPE and returns the result back to the PPE upon completion. To avoid excessive overhead from repeated thread spawning and joining, threads remain bound on SPEs and busy-wait for the PPE signal, before starting to execute a function Optimizing Off-Loaded Functions The discussion in this section refers to function newview(), which is the most computationally expensive in the code. Table 4.1 summarizes the execution times of RAxML before and after newview() is offloaded. The first column shows the number of workers (MPI processes) used in the experiment and the amount of work (bootstraps) performed. The maximum number 19

36 1 worker, 1 bootstrap 24.4s 2 workers, 8 bootstraps 134.1s 2 workers, 16 bootstraps 267.7s 2 workers, 32 bootstraps 539s (a) 1 worker, 1 bootstrap 45s 2 workers, 8 bootstraps 201.9s 2 workers, 16 bootstraps 401.7s 2 workers, 32 bootstraps 805s (b) Table 4.1: Execution time of RAxML (in seconds). The input file is 42 SC. (a) The whole application is executed on the PPE, (b) newview() is offloaded on one SPE. of workers we use is 2, since more workers would conflict on the PPE which is 2-way SMT processor. Executing small number of workers results in low SPE utilization (each worker uses 1 SPE). In Section 4.3, we present results when the PPE is oversubscribed with up to 8 worker processes. As shown in Table 4.1, merely offloading newview() causes performance degradation. We profiled the new version of the code in order to get a better understanding of the major bottlenecks. Inside newview(), we identified 3 parts where the function spends almost its entire lifetime: the first part includes a large if(...) statement with a conjunction of four arithmetic comparisons used to check if small likelihood vector entries need to be scaled to avoid numerical underflow (similar checks are used in every ML implementation); the second time-consuming part involves DMA transfers; the third includes the loops that perform the actual likelihood vector calculation. In the next few sections we describe the techniques used to optimize the aforementioned parts in newview(). The same techniques were applied to the other offloaded functions Vectorizing Conditional Statements RAxML always invokes newview() at an inner node of the tree (p) which is at the root of a subtree. The main computational kernel in newview() has a switch statement which selects one out of four paths of execution. If one or both descendants (r and q) of p are tips (leaves), the computations of the main loop in newview() can be simplified. This optimization leads to significant 20

37 performance improvements [87]. To activate the optimization, we use four implementations of the main computational part of newview() for the case that r and q are tips, r is a tip, q is a tip, or r and q are both inner nodes. Each of the four execution paths in newview() leads to a distinct highly optimized version of the loop which performs the actual likelihood vector calculations. Each iteration of this loop executes the previously mentioned if() statement (Section 4.2.1), to check for likelihood scaling. Mis-predicted branches in the compiled code for this statement incur a penalty of approximately 20 cycles [92]. We profiled newview() and found that 45% of the execution time is spent in this particular conditional statement. Furthermore, almost all the time is spent in checking the condition, while negligible time is spent in the body of code in the fall-through part of the conditional statement. The problematic conditional statement is shown below. The symbol ml is a constant and all operands are double precision floating point numbers. if (ABS(x3->a) < ml && ABS(x3->g) < ml && ABS(x3->c) < ml && ABS(x3->t) < ml) { }... This statement is a challenge for a branch predictor, since it implies 8 conditions, one for each of the four ABS() macros and the four comparisons against the minimum likelihood value constant (ml). On an SPE, comparing integers can be significantly faster than comparing doubles, since integer values can be compared using the SPE intrinsics. Although the current SPE intrinsics support only comparison of 32-bit integer values, the comparison of 64-bit integers is also possible by combining different intrinsics that operate on the 32-bit integers. The current spu-gcc compiler automatically optimizes an integer branch using the SPE intrinsics. To optimize the problematic branches, we made the observation that integer comparison is faster than floating 21

38 1 worker, 1 bootstrap 32.5s 2 workers, 8 bootstraps 151.7s 2 workers, 16 bootstraps 302.7s 2 workers, 32 bootstraps 604s Table 4.2: Execution time of RAxML after the floating-point conditional statement is transformed to an integer conditional statement and vectorized. The input file is 42 SC. point comparison on an SPE. According to the IEEE standard, numbers represented in float and double formats are lexicographically ordered [61], i.e., if two floating point numbers in the same format are ordered, then they are ordered the same way when their bits are reinterpreted as Sign-Magnitude integers [61]. In other words, instead of comparing two floating point numbers we can interpret their bit pattern as integers, and do an integer comparison. The final outcome of comparing the integer interpretation of two doubles (floats) will be the same as comparing their floating point values, as long as one of the numbers is positive. In our case, all operands are positive, consequently instead of floating point comparison we can perform an integer comparison. To get an absolute value of a floating point number, we used the spu and() logic intrinsic, which performs vector bit-wise AND operation. With spu and() we always set the left most bit of a floating point number to one. If the number is already positive, nothing will change, since the most significant bit is already one. In this way, we avoid using ABS(), which uses a conditional statement to check if the operand is greater than or less than 0. After getting absolute values of all the operands involved in the problematic if() statement, we cast each operand to an unsigned long long value and perform the comparison. The optimized conditional statement is presented in Figure Following optimization of the offending conditional statement, its contribution to execution time in newview() comes down to 6%, as opposed to 45% before optimization. The total execution time (Table 4.2) improves by 25% 27%. 22

39 unsigned long long a[4]; a[0] = *(unsigned long long*)&x3->a & 0x7fffffffffffffffULL; a[1] = *(unsigned long long*)&x3->c & 0x7fffffffffffffffULL; a[2] = *(unsigned long long*)&x3->g & 0x7fffffffffffffffULL; a[3] = *(unsigned long long*)&x3->t & 0x7fffffffffffffffULL; if (*(unsigned long long*)&a[0] < minli && *(unsigned long long*)&a[1] < minli && *(unsigned long long*)&a[2] < minli && *(unsigned long long*)&a[3] < minli){ } Double Buffering and Memory Management Depending on the size of the input alignment, the major calculation loop (the loop that performs the calculation of the likelihood vector) in newview() can execute up to 50,000 iterations. The number of iterations is directly related to the alignment length. The loop operates on large arrays, and each member in the arrays is an instance of a likelihood vector structure, shown in Figure 4.1. The arrays are allocated dynamically at runtime. Since there is no limit on the typedef struct likelihood_vector { double a, c, g, t; int exp; } likelivector attribute ((aligned(128))); Figure 4.1: The likelihood vector structure is used in almost all memory traffic between main memory and the local storage of the SPEs. The structure is 128-bit aligned, as required by the Cell architecture. size of these arrays, we are unable to keep all the members of the arrays in the local storage of 23

40 1 worker, 1 bootstrap 31.1s 2 workers, 8 bootstraps 145.4s 2 workers, 16 bootstraps 290s 2 workers, 32 bootstraps 582.6s Table 4.3: Execution time of RAxML with double buffering applied to overlap DMA transfers with computation. The input file is 42 SC. SPEs. Instead, we strip-mine the arrays, by fetching a few array elements to local storage at a time, and execute the corresponding loop iterations on a batch of elements at a time. We use a 2 KByte buffer for caching likelihood vectors, which is enough to store the data needed for 16 loop iterations. It should be noted that the space used for buffers is much smaller than the size of the local storage. In the original code where SPEs wait for all DMA transfers, the idle time accounts for 11.4% of execution time of newview(). We eliminated the waiting time by using double buffering to overlap DMA transfers with computation. The total execution time of the application after applying double buffering and tuning the data transfer size (set to 2 KBytes) is shown in Table Vectorization All calculations in newview() are enclosed in two loops. The first loop has a small trip count (typically 4 25 iterations) and computes the individual transition probability matrices (see Section 4.2.1) for each distinct rate category of the CAT or Γ models of rate heterogeneity [86]. Each iteration executes 36 double precision floating point operations. The second loop computes the likelihood vector. Typically, the second loop has a large trip count, which depends on the number of distinct data patterns in the data alignment. For the 42 SC input file, the second loop has 228 iterations and executes 44 double precision floating point operations per iteration. Each SPE on the Cell is capable of exploiting data parallelism via vectorization. The SPE vector registers can store two double precision floating point elements. We vectorized the two loops in 24

41 newview() using these registers. The kernel of the first loop in newview() is shown in Figure 4.2a. In Figure 4.2b we for(... ) { ki = *rptr++; } d1c = exp (ki * lz10); d1g = exp (ki * lz11); d1t = exp (ki * lz12); *left++ = d1c * *EV++; *left++ = d1g * *EV++; *left++ = d1t * *EV++; *left++ = d1c * *EV++; *left++ = d1g * *EV++; *left++ = d1t * *EV++;... 1: vector double *left_v = (vector double*)left; 2: vector double lz1011 = (vector double)(lz10,lz11);... for(... ) { 3: ki_v = spu_splats(*rptr++); 4: d1cg = _exp_v ( spu_mul(ki_v,lz1011) ); d1tc = _exp_v ( spu_mul(ki_v,lz1210) ); d1gt = _exp_v ( spu_mul(ki_v,lz1112) ); } left_v[0] = spu_mul(d1cg,ev_v[0]); left_v[1] = spu_mul(d1tc,ev_v[1]); left_v[2] = spu_mul(d1gt,ev_v[2]);... (a) (b) Figure 4.2: The body of the first loop in newview(): a) Non vectorized code, b) Vectorized code. show the same code vectorized for the SPE. For better understanding of the vectorized code we briefly describe the SPE vector instructions we used: Instruction labeled 1 creates a vector pointer to an array consisting of double elements. Instruction labeled 2 joins two double elements, lz10 and lz11, into a single vector element. Instruction labeled 3 creates a vector from a single double element. Instruction labeled 4 is a composition of 2 different vector instructions: 25

42 for(... ) { ump_x1_0 = x1->a; ump_x1_0 += x1->c * *left++; ump_x1_0 += x1->g * *left++; ump_x1_0 += x1->t * *left++; } ump_x1_1 = x1->a; ump_x1_1 += x1->c * *left++; ump_x1_1 += x1->g * *left++; ump_x1_1 += x1->t * *left++;... for(... ) { a_v = spu_splats(x1->a); c_v = spu_splats(x1->c); g_v = spu_splats(x1->g); t_v = spu_splats(x1->t); l1 = (vector double)(left[0],left[3]); l2 = (vector double)(left[1],left[4]); l3 = (vector double)(left[2],left[5]); ump_v1[0] = spu_madd(c_v,l1,a_v); ump_v1[0] = spu_madd(g_v,l2,ump_v1[0]); ump_v1[0] = spu_madd(t_v,l3,ump_v1[0]); }... Figure 4.3: The second loop in newview(). Non vectorized code shown on the left, vectorized code shown on the right. spu madd() multiplies the first two arguments and adds the result to the third argument. spu splats() creates a vector by replicating a scalar element. 1. spu mul() multiplies two vectors (in this case the arguments are vectors of doubles.) 2. exp v() is the vector version of the exponential instruction. After vectorization, the number of the floating point instructions executed in the body of the first loop is 24. Also, there is one additional instruction for creating a vector from a scalar element. Note that due to involved pointer arithmetic on dynamically allocated data structures, automatic vectorization of this code would be particularly challenging for a compiler. Figure 4.3 illustrates the second loop(showing a few selected instructions which dominate execution time in the loop). The variables x1->a, x1->c, x1->g, and x1->t belong to the same C structure (likelihood vector) and occupy contiguous memory locations. Only three of these variables are multiplied by the elements of the array left[ ]. This makes vectorization more difficult, since the code requires vector construction instructions such as spu splats(). Obviously, there are many different possibilities for vectorizing this code. The scheme shown in Figure

43 1 worker, 1 bootstrap workers, 8 bootstraps 132.3s 2 workers, 16 bootstraps 265.2s 2 workers, 32 bootstraps 527s Table 4.4: Execution time of RAxML following vectorization. The input file is 42 SC. is the one that achieved the best performance in our tests. Note that due to involved pointer arithmetic on dynamically allocated data structures, automatic vectorization of this code may be challenging for a compiler. After vectorization, the number of floating point instructions in the body of the loops drops from 36 to 24 for the first loop, and from 44 to 22 for the second loop. Vectorization adds 25 instructions for creating vectors. Without vectorization, newview() spends 69.4% of its execution time in the two loops. Following vectorization, the time spent in loops drops to 57% of the execution time of newview(). Table 4.4 shows execution times following vectorization PPE-SPE Communication Although newview() accounts for most of the execution time, its granularity is fine and its contribution to execution time is attributed to the large number of invocations. For the 42 SC input, newview() is invoked 230,500 times and the average execution time per invocation is 71µs. In order to invoke an offloaded function, the PPE needs to send a signal to an SPE. Also, after an offloaded function completes, it sends the result back to the PPE. In an early implementation of RAxML, we used mailboxes to implement the communication between the PPE and SPEs. We observed that PPE-SPE communication can be significantly improved if it is performed through main memory and SPE local storage instead of mailboxes. Using memory-to-memory communication improves execution time by 5% 6.4%. Table 4.5 shows RAxML execution times, including all optimizations discussed so far and direct memory to memory communication, for the 42 SC input. It is interesting to note that direct memory- 27

44 1 worker, 1 bootstrap 26.4s 2 workers, 8 bootstraps 123.3s 2 workers, 16 bootstraps 246.8s 2 workers, 32 bootstraps 493.3s Table 4.5: Execution time of RAxML following the optimization of communication to use direct memory-to-memory transfers. The input file is 42 SC. to-memory communication is an optimization which scales with parallelism on Cell, i.e. its performance impact grows as the code uses more SPEs. As the number of workers and bootstraps executed on the SPEs increases, the code becomes more communication-intensive, due to the fine granularity of the offloaded functions Increasing the Coverage of Offloading In addition to newview(), we offloaded makenewz() and evaluate(). All three offloaded functions were packaged in a single code module loaded on the SPEs. The advantage of using a single module is that it can be loaded to the local storage once when an SPE thread is created and remain pinned in local storage for the rest of the execution. Therefore, the cost of loading the code on SPEs is amortized and communication between the PPE and SPEs is reduced. For example, when newview() is called by makenewz() or evaluate(), there is no need for any PPE- SPE communication, since all functions already reside in SPE local storage. Offloading all three critical functions improves performance by a further 25% 31%. A more important implication is that after offloading and optimization of all three functions, the RAxML code split between the PPE and one SPE becomes actually faster than the sequential code executed exclusively on the PPE, by as much as 19%. Function offloading is another optimization which scales with parallelism. When more than one MPI processes are used and more than one bootstraps are offloaded to SPEs by each process, the gains from offloading rise to 36%. Table 4.6 illustrates execution times after full function offloading. 28

45 1 worker, 1 bootstrap 19.8s 2 workers, 8 bootstraps 86.8s 2 workers, 16 bootstraps 173s 2 workers, 32 bootstraps 344.4s Table 4.6: Execution time of RAxML after offloading and optimizing three functions: newview(), makenewz() and evaluate(). The input file is 42 SC Execution Time (s) RAxML Execution Time (s) PBPI Number of Processes Number of Processes (a) (b) Figure 4.4: Performance of (a) RAxML and (b) PBPI with different number of MPI processes. 4.3 Parallel Execution After improving the performance of RAxML and PBPI using the presented optimization techniques, we investigated parallel execution of both applications on the Cell processor. To achieve higher utilization of the Cell chip, we oversubscribed the PPE with different number of MPI processes (2 8) and assigned a single SPE to each MPI process. The execution time of different parallel configurations is presented in Figure 4.4. In the presented experiments we use strong scaling, i.e. the computation increases with the number of processors growing. In Figure 4.4(a) we observe that for any number of processes larger than two, the execution time of RAxML remains constant. There are two reasons responsible for the detected behavior: 1. On-chip contention, as well as bus and memory contention which occurs on the PPE side when the PPE is oversubscribed by multiple processes, 29

46 2. Linux kernel is oblivious to the process of off-loading which results in poor scheduling decisions. Each process following the off-loading execution model constantly alternates the execution between the PPE and an SPE. Unaware of the execution alternation, the OS allows processes to keep the control over the resources which are not actually used. In other words, the PPE might be assigned to a process which is currently switched to the SPE execution. In PBPI case, Figure 4.4(b) we observe similar performance as with RAxML. From the presented experiments it is clear that naive parallelization of the applications, where the PPE is simply oversubscribed with multiple processes, does not provide satisfactory performance. Poor scaling of the applications is a strong motivation for detail exploration of different parallel programming models as well as scheduling policies for the asymmetric processors. We continue the discussion about parallel execution on heterogeneous architectures in Chapter Chapter Summary In this chapter we presented a set of optimizations which enable efficient sequential execution of scientific applications on asymmetric platforms. We exploited the fact that our test application contain large computational functions (loops) which consume majority of the execution time. Nevertheless, this assumption does not reduce the generality of the presented techniques, since large, time-consuming computational loops are common in most of the scientific codes. We explored a total of five optimizations and the performance implications of these optimizations: I) Offloading the bulk of the maximum likelihood tree calculation on the accelerators; II) Casting and vectorization of expensive conditional statements involving multiple, hard to predict conditions; III) Double buffering for overlapping memory communication with computation; IV) Vectorization of the core of the floating point computation; V) Optimization of communication between the host core and accelerators using direct memory-to-memory transfers; 30

47 In our case study, starting from an optimized version of RAxML and PBPI for conventional uniprocessors and multiprocessors, we were able to boost performance on the Cell processor by more than a factor of two. 31

48 32

49 Chapter 5 Scheduling Multigrain Parallelism on Asymmetric Systems 5.1 Introduction In this chapter, we investigate runtime scheduling policies for mapping different layers of parallelism, exposed by an application, to the Cell processor. We assume that applications describe all available algorithmic parallelism to the runtime system explicitly, while the runtime system dynamically selects the degree of granularity and the dimensions of parallelism to expose to the hardware at runtime, using dynamic scheduling mechanisms and policies. In other words, the runtime system is responsible for partitioning algorithmic parallelism in layers that best match the diverse capabilities of the processor cores, while at the same time rightsizing the granularity of parallelism in each layer. 5.2 Scheduling Multi-Grain Parallelism on Cell We hereby explore the possibilities for exploiting multi-grain parallelism on Cell. The Cell PPE can execute two threads or processes simultaneously, from where parts of code can be offloaded and executed on SPEs. To increase the sources of parallelism for SPEs, the user may consider two approaches: The user may oversubscribe the PPE with more processes or threads, than the number of 33

50 processes/threads that the PPE can execute simultaneously. In other words, the programmer attempts to find more parallelism to off-load to accelerators, by attempting a more fine-grain task decomposition of the code. In this case, the runtime system needs to schedule the host processes/threads so as to minimize the idle time on the host core while the computation is off-loaded to accelerators. We present an event-driven task-level scheduler (EDTLP) which achieves this goal in Section The user can introduce a new dimension of parallelism to the application by distributing loops from within the off-loaded functions across multiple SPEs. In other words, the user can exploit data parallelism both within and across accelerators. Each SPE can work on a part of a distributed loop, which can be further accelerated with SIMDization. We present case studies that motivate the dynamic extraction of multi-grain parallelism via loop distribution in Section Event-Driven Task Scheduling EDTLP is a runtime scheduling module which can be embedded transparently in MPI codes. The EDTLP scheduler operates under the assumption that the code to off-load to accelerators is specified by the user at the level of functions. In the case of Cell, this means that the user has either constructed SPE threads in a separate code module, or annotated the host PPE code with directives to extract SPE threads via a compiler [17]. The EDTLP scheduler avoids underutilization of SPEs by oversubscribing the PPE and preventing a single MPI process from monopolizing the PPE. Informally, the EDTLP scheduler off-loads tasks from MPI processes. A task ready for offloading serves as an event trigger for the scheduler. Upon the event occurrence, the scheduler immediately attempts to serve the MPI process that carries the task to off-load and sends the task to an available SPE, if any. While off-loading a task, the scheduler suspends the MPI process that spawned the task and switches to another MPI process, anticipating that more tasks 34

51 will be available for off-loading from ready-to-run MPI processes. Switching upon off-loading prevents MPI processes from blocking the PPE while waiting for their tasks to return. The scheduler attempts to sustain a high supply of tasks for off-loading to SPEs by serving MPI processes round-robin. The downside of a scheduler based on oversubscribing a processor is context-switching overhead. Cell in particular also suffers from the problem of interference between processes or threads sharing the SMT PPE core. The granularity of the off-loaded code determines if the overhead introduced by oversubscribing the PPE can be tolerated. The code off-loaded to an SPE should be coarse enough to marginalize the overhead of context switching performed on the PPE. The EDTLP scheduler addresses this issue by performing granularity control of the off-loaded tasks and preventing off-loading of code that does not meet a minimum granularity threshold. Figure 5.1 illustrates an example of the difference between scheduling MPI processes with the EDTLP scheduler and the native Linux scheduler. In this example, each MPI process has one task to off-load to SPEs. For illustrative purposes only, we assume that there are only 4 SPEs on the chip. In Figure 5.1(a), once a task is sent to an SPE, the scheduler forces a context switch on the PPE. Since the PPE is a two-way SMT, two MPI processes can simultaneously off-load tasks to two SPEs. The EDTLP scheduler enables the use of four SPEs via function offloading. On the contrary, if the scheduler waits for the completion of a task before providing an opportunity to another MPI process to off-load (Figure 5.1 (b)), the application can only utilize two SPEs. Realistic application tasks often have significantly shorter lengths than the time quanta used by the Linux scheduler. For example, in RAxML, task lengths measure in the order of tens of microseconds, when Linux time quanta measure to tens of milliseconds. Table 5.1(a) compares the performance of the EDTLP scheduler to that of the native Linux scheduler, using RAxML and running a workload comprising 42 organisms. In this experiment, the number of performed bootstraps is not constant and it is equal to the number of MPI processes. The EDTLP scheduler outperforms the Linux scheduler by up to a factor of 2.7. In the 35

52 (a) (b) Figure 5.1: Scheduler behavior for two off-loaded tasks, representative of RAxML. Case (a) illustrates the behavior of the EDTLP scheduler. Case (b) illustrates the behavior of the Linux scheduler with the same workload. The numbers correspond to MPI processes. The shaded slots indicate context switching. The example assumes a Cell-like system with four SPEs. experiment with PBPI, Table 5.1(b), we execute the code with one Markov chain for 20,000 generations and we change the number of MPI processes used across runs. PBPI is also executed with weak scaling, i.e. we increase the size of the DNA alignment with the number of processes. The workload for PBPI includes 107 organisms. EDTLP outperforms the Linux scheduler policy in PBPI by up to a factor of Scheduling Loop-Level Parallelism The EDTLP model described in Section 5.2 is effective if the PPE has enough coarse-grained functions to off-load to SPEs. In cases where the degree of available task parallelism is less than the number of SPEs, the runtime system can activate a second layer of parallelism, by splitting an already off-loaded task across multiple SPEs. We implemented runtime support for parallelization of for-loops enclosed within off-loaded SPE functions. We parallelize loops in off-loaded functions using work-sharing constructs similar to those found in OpenMP. In RAxML, all for-loops in the three off-loaded functions have no loop-carried dependencies, and obtain speedup from parallelization, assuming that there are enough idle SPEs dedicated to their execution. The number of SPEs activated for work-sharing is user- or system-controlled, as in 36

53 EDTLP Linux 1 worker, 1 bootstrap 19.7s 19.7s 2 workers, 2 bootstraps 22.2s 30s 3 workers, 3 bootstraps 26s 40.7s 4 workers, 4 bootstraps 28.1s 43.3s 5 workers, 5 bootstraps 33s 60.7s 6 workers, 6 bootstraps 34s 61.8s 7 workers, 7 bootstraps 38.8s 81.2s 8 workers, 8 bootstraps 39.8s 81.7s (a) EDTLP Linux 1 worker, 20,000 gen s 27.54s 2 workers, 20,000 gen. 30.2s 30s 3 workers, 20,000 gen s 56.16s 4 workers, 20,000 gen. 36.4s 63.7s 5 workers, 20,000 gen s 93.71s 6 workers, 20,000 gen s 93s 7 workers, 20,000 gen s s 8 workers, 20,000 gen s s (b) Table 5.1: Performance comparison for (a) RAxML and (b) PBPI with two schedulers. The second column shows execution time with the EDTLP scheduler. The third column shows execution time with the native Linux kernel scheduler. The workload for RAxML contains 42 organisms. The workload for PBPI contains 107 organisms. OpenMP. We discuss dynamic system-level control of loop parallelism further in Section 5.3. The parallelization scheme is outlined in Figure 5.2. The program is executed on the PPE until the execution reaches the parallel loop to be off-loaded. At that point the PPE sends a signal to a single SPE which is designated as the master. The signal is processed by the master and further broadcasted to all workers involved in parallelization. Upon a signal reception, each SPE worker fetches the data necessary for loop execution. We ensure that SPEs work on different parts of the loop and do not overlap by assigning a unique identifier to each SPE thread involved in parallelization of the loop. Global data, changed by any of the SPEs during 37

54 loop execution, is committed to main memory at the end of each iteration. After processing the assigned parts of the loop, the SPE workers send a notification back to the master. If the loop includes a reduction, the master collects also partial results from the SPEs and accumulates them locally. All communication between SPEs is performed on chip in order to avoid the long latency of communicating through shared memory. Note that in our loop parallelization scheme on Cell, all work performed by the master SPE can also be performed by the PPE. In this case, the PPE would broadcast a signal to all SPE threads involved in loop parallelization and the partial results calculated by SPEs would be accumulated back at the PPE. Such collective operations increase the frequency of SPE-PPE communication, especially when the distributed loop is a nested loop. In the case of RAxML, in order to reduce SPE-PPE communication and avoid unnecessary invocation of the MPI process that spawned the parallelized loop, we opted to use an SPE to distribute loops to other SPEs and collect the results from other SPEs. In PBPI, we let the PPE execute the master thread during loop parallelization, since loops are coarse enough to overshadow the loop execution overhead. Optimizing and selecting between these loop execution schemes is a subject of ongoing research. SPE threads participating in loop parallelization are created once upon off-loading the code for the first parallel loop to SPEs. The threads remain active and pinned to the same SPEs during the entire program execution, unless the scheduler decides to change the parallelization strategy and redistribute the SPEs between one or more concurrently executing parallel loops. Pinned SPE threads can run multiple off-loaded loop bodies, as long as the code of these loop bodies fits on the local storage of the SPEs. If the loop parallelization strategy is changed on the fly by the runtime system, a new code module with loop bodies that implement the new parallelization strategy is loaded on the local storage of the SPEs. Table 5.2 illustrates the performance of the basic loop-level parallelization scheme of our runtime system in RAxML. Table 5.2(a) illustrates the execution time of RAxML using one MPI process and performing one bootstrap, on a data set which comprises 42 organisms. This 38

55 Master Master executes iterations from 1 to x/8 Master sending start signal to Worker1 Worker1 Worker1 executes iterations from x/8 to x/4 Master sending start signal to Worker7... Worker7 Worker7 executes iterations from 7/8x to x Worker1 sending stop signal to Master Worker7 sending stop signal to Master Total number x of iterations Figure 5.2: Parallelizing a loop across SPEs using a work-sharing model with an SPE designated as the master. experiment isolates the impact of our loop-level parallelization mechanisms on Cell. The number of iterations in parallelized loops depends on the size of the input alignment in RAxML. For the given data set, each parallel loop executes 228 iterations. The results shown in Table 5.2(a) suggest that when using loop-level parallelism RAxML sees a reasonable yet limited performance improvement. The highest speedup (1.72) is achieved with 7 SPEs. The reasons for the modest speedup are the non-optimal coverage of loop-level parallelism more specifically, less than 90% of the original sequential code is covered by parallelized loops, the fine granularity of the loops, and the fact that most loops have reductions, which create bottlenecks on the Cell DMA engine. The performance degradation that occurs when 5 or 6 SPEs are used, happens because of specific memory alignment constraints that have to be met on the SPEs. It is due to alignment constraints that in certain occasions it is not possible to evenly distribute the data used in the loop body and therefore the workload of iterations between SPEs. More specifically, the use of character arrays for the main data set in RAxML 39

56 1 worker, 1 boot., no LLP 19.7s 1 worker, 1 boot., 2 SPEs used for LLP 14s 1 worker, 1 boot., 3 SPEs used for LLP 13.36s 1 worker, 1 boot., 4 SPEs used for LLP 12.8s 1 worker, 1 boot., 5 SPEs used for LLP 13.8s 1 worker, 1 boot., 6 SPEs used for LLP 12.47s 1 worker, 1 boot., 7 SPEs used for LLP 11.4s 1 worker, 1 boot., 8 SPEs used for LLP 11.44s (a) 1 worker, 1 boot., no LLP 47.9s 1 worker, 1 boot., 2 SPEs used for LLP 29.5s 1 worker, 1 boot., 3 SPEs used for LLP 23.3s 1 worker, 1 boot., 4 SPEs used for LLP 20.5s 1 worker, 1 boot., 5 SPEs used for LLP 18.7s 1 worker, 1 boot., 6 SPEs used for LLP 18.1s 1 worker, 1 boot., 7 SPEs used for LLP 17.1s 1 worker, 1 boot., 8 SPEs used for LLP 16.8s (b) Table 5.2: Execution time of RAxML when loop-level parallelism (LLP) is exploited in one bootstrap, via work distribution between SPEs. The input file is 42 SC: (a) DNA sequences are represented with 10,000 nucleotides, (b) DNA sequences are represented with 20,000 nucleotides. forces array transfers in multiples of 16 array elements. Consequently, loop distribution across processors is done with a minimum chunk size of 16 iterations. Loop-level parallelization in RAxML can achieve higher speedup in a single bootstrap with larger input data sets. Alignments that have a larger number of nucleotides per organism have more loop iterations to distribute across SPEs. To illustrate the behavior of loop-level parallelization with coarser loops, we repeated the previous experiment using a data set where the DNA sequences are represented with 20,000 nucleotides. The results are shown in Table 5.2(b). The performance of the loop-level parallelization scheme always increases with the number of SPEs in this experiment. 40

57 PBPI exhibits clearly better scalability than RAxML with LLP, since the granularity of loops is coarser in PBPI than RAxML. Table 5.3 illustrates the execution times when PBPI is executed with a variable number of SPEs used for LLP. Again, we control the granularity of the off-loaded code by using different data sets: Table 5.3(a) shows execution times for a data set that contains 107 organisms, each represented by a DNA sequence of 3,000 nucleotides. Table 5.3(b) shows execution times for a data set that contains 107 organisms, each represented by a DNA sequence of 10,000 nucleotides. We run PBPI with one Markov chain for 20,000 generations. For the two data sets, PBPI achieves a maximum speedup of 4.6 and 6.1 respectively, after loop-level parallelization. 1 worker, 1,000 gen., no LLP 27.2s 1 worker, 1,000 gen., 2 SPEs used for LLP 14.9s 1 worker, 1,000 gen., 3 SPEs used for LLP 11.3s 1 worker, 1,000 gen., 4 SPEs used for LLP 8.4s 1 worker, 1,000 gen., 5 SPEs used for LLP 7.3s 1 worker, 1,000 gen., 6 SPEs used for LLP 6.8s 1 worker, 1,000 gen., 7 SPEs used for LLP 6.2s 1 worker, 1,000 gen., 8 SPEs used for LLP 5.9s (a) 1 worker, 20,000 gen., no LLP 262s 1 worker, 20,000 gen., 2 SPEs used 131.3s 1 worker, 20,000 gen., 3 SPEs used 92.3s 1 worker, 20,000 gen., 4 SPEs used 70.1s 1 worker, 20,000 gen., 5 SPEs used 58.1s 1 worker, 20,000 gen., 6 SPEs used 49s 1 worker, 20,000 gen., 7 SPEs used 43s 1 worker, 20,000 gen., 8 SPEs used 39.7s (b) Table 5.3: Execution time of PBPI when loop-level parallelism (LLP) is exploited via work distribution between SPEs. The input file is 107 SC: (a) DNA sequences are represented with 1,000 nucleotides, (b) DNA sequences are represented with 10,000 nucleotides. 41

58 struct Pass{ volatile unsigned int v1_ad; volatile unsigned int v2_ad; //...arguments for loop body volatile unsigned int vn_ad; volatile double res; volatile int sig[2]; } attribute ((aligned(128))); Figure 5.3: The data structure Pass is used for communication among SPEs. The v i ad variables are used to pass input arguments for the loop body from one local storage to another. The variable sig is used as a notification signal that the memory transfer for the shared data updated during the loop is completed. The variable res is used to send results back to the master SPE, and as a dependence resolution mechanism Implementing Loop-Level Parallelism The SPE threads participating in loop work-sharing constructs are created once upon function off-loading. Communication among SPEs participating in work-sharing constructs is implemented using DMA transfers and the communication structure Pass, depicted in Figure 5.3. The Pass structure is private to each thread. The master SPE thread allocates an array of Pass structures. Each member of this array is used for communication with an SPE worker thread. Once the SPE threads are created, they exchange the local addresses of their Pass structures. This address exchange is performed through the PPE. Whenever one thread needs to send a signal to a thread on another SPE, it issues an mfc put() request and sets the destination address to be the address of the Pass structure of the recipient. In Figure 5.4, we illustrate a RAxML loop parallelized with work-sharing among SPE threads. Before executing the loop, the master thread sets the parameters of the Pass structure for each worker SPE and issues one mfc put() request per worker. This is done in send to spe(). Worker i uses the parameters of the received Pass structure and fetches the data needed for the loop execution to its local storage (function fetch data()). After 42

59 finishing the execution of its portion of the loop, a worker sets the res parameter in the local copy of the structure Pass and sends it to the master, using send to master(). The master accumulates the results from all workers and commits the sum to main memory. Immediately after calling send to spe(), the master participates in the execution of the loop. The master tends to have a slight head start over the workers. The workers need to complete several DMA requests before they can start executing the loop, in order to fetch the required data from the master s local storage or shared memory. In fine-grained off-loaded functions such as those encountered in RAxML, load imbalance between the master and the workers is noticeable. To achieve better load balancing, we set the master to execute a slightly larger portion of the loop. A fully automated and adaptive implementation of this purposeful load unbalancing is obtained by timing idle periods in the SPEs across multiple invocations of the same loop. The collected times are used for tuning iteration distribution in each invocation, in order to reduce idle time on SPEs. 5.3 Dynamic Scheduling of Task- and Loop-Level Parallelism Merging task-level and loop-level parallelism on Cell can improve the utilization of accelerators. A non-trivial problem with such a hybrid parallelization scheme is the assignment of accelerators to tasks. The optimal assignment is largely application-specific, task-specific and input-specific. We support this argument using RAxML as an example. The discussion in this section is limited to RAxML, where the degree of outermost parallelism can be changed arbitrarily by varying the number of MPI processes executing bootstraps, with a small impact on performance. PBPI uses a data decomposition approach which depends on the number of processors, therefore dynamically varying the number of MPI processes executing the code at runtime can not be accomplished without data redistribution. 43

60 Master SPE: struct Pass pass[num_spe]; for(i=0; i < Num_SPE; i++){ pass[i].sig[0] = 1;... send_to_spe(i,&pass[i]); } /* Paralelized loop */ for (... ) {... } tr->likeli = sum; Worker SPE: struct Pass pass; while(pass.sig[0]==0); fetch_data(); /* Paralelized loop */ for (... ) {... } tr->likeli = sum; pass.res = sum; pass.sig[1] = 1; send_to_master(&pass); for(i=0; i < Num_SPE; i++){ while(pass[i].sig[1] == 0); pass[i].sig[1] = 0; tr->likeli += pass[i].res; } commit(tr->likeli); Figure 5.4: Parallelization of the loop from function evaluate() in RAxML. The left side depitcs the code executed by the master SPE, while the right side depitcs the code executed by a worker SPE. Num SPE represents the number of SPE worker threads Application-Specific Hybrid Parallelization on Cell We present a set of experiments with RAxML performing a number of bootstraps ranging between 1 and 128. In these experiments we use three versions of RAxML. Two of the three versions use hybrid parallelization models combining task- and loop-level parallelism. The third version exploits only task-level parallelism and uses the EDTLP scheduler. More specifically, in the first version, each off-loaded task is parallelized across 2 SPEs, and 4 MPI processes are multiplexed on the PPE, executing 4 concurrent bootstraps. In the second version, each off-loaded task is parallelized across 4 SPEs and 2 MPI processes are multiplexed on the PPE, 44

61 Execution time in seconds EDTLP+LLP with 4 SPEs per parallel loop EDTLP+LLP with 2 SPEs per parallel loop EDTLP Number of bootstraps (a) Execution time in seconds EDTLP+LLP with 4 SPEs per parallel loop EDTLP+LLP with 2 SPEs per parallel loop EDTLP Number of bootstraps (b) Figure 5.5: Comparison of task-level and hybrid parallelization schemes in RAxML, on the Cell BE. The input file is 42 SC. The number of ML trees created is (a) 1 16, (b) executing 2 concurrent bootstraps. In the third version, the code concurrently executes 8 MPI processes, the off-loaded tasks are not parallelized and the tasks are scheduled with the EDTLP scheduler. Figure 5.5 illustrates the results of the experiments, with a data set representing 42 organisms. The x-axis shows the number of bootstraps, while the y-axis shows execution time in seconds. As expected, the hybrid model outperforms EDTLP when up to 4 bootstraps are executed, since only a combination of EDTLP and LLP can off-load code to more than 4 SPEs simul- 45

62 taneously. With 5 to 8 bootstraps, the hybrid models execute bootstraps in batches of 2 and 4 respectively, while the EDTLP model executes all bootstraps in parallel. EDTLP activates 5 to 8 SPEs solely for task-level parallelism, leaving room for loop-level parallelism on at most 3 SPEs. This proves to be unnecessary, since the parallel execution time is determined by the length of the non-parallelized off-loaded tasks that remain on at least one SPE. In the range between 9 and 12 bootstraps, combining EDTLP and LLP selectively, so that the first 8 bootstraps execute with EDTLP and the last 4 bootstraps execute with the hybrid scheme is the best option. For the input data set with 42 organisms, performance of EDTLP and hybrid EDTLP- LLP schemes is almost identical when the number of bootstraps is between 13 and 16. When the number of bootstraps is higher than 16, EDTLP clearly outperforms any hybrid scheme (Figure 5.5(b)). The reader may notice that the problem of hybrid parallelization is trivialized when the problem size is scaled beyond a certain point, which is 28 bootstraps in the case of RAxML (see Section 5.3.2). A production run of RAxML for real-world phylogenetic analysis would require up to 1,000 bootstraps, thus rendering hybrid parallelization seemingly unnecessary. However, if a production RAxML run with 1,000 bootstraps were to be executed across multiple Cell BEs, and assuming equal division of bootstraps between the processors, the cut-off point for EDTLP outperforming the hybrid EDTLP-LLP scheme would be set at 36 Cell processors. Beyond this scale, performance per processor would be maximized only if LLP were employed in conjunction with EDTLP on each Cell. Although this observation is empirical and somewhat simplifying, it is further supported by the argument that scaling across multiple processors will in all likelihood increase communication overhead and therefore favor a parallelization scheme with less MPI processes. The hybrid scheme reduces the volume of MPI processes compared to the pure EDTLP scheme, when the granularity of work per Cell becomes fine. 46

63 5.3.2 MGPS The purpose of MGPS is to dynamically adapt the parallel execution by either exposing only one layer of task parallelism to the SPEs via event-driven scheduling, or expanding to the second layer of data parallelism and merging it with task parallelism when SPEs are underutilized at runtime. MGPS extends the EDTLP scheduler with an adaptive processor-saving policy. The scheduler runs locally in each process and it is driven by two events: arrivals, which correspond to off-loading functions from PPE processes to SPE threads; departures, which correspond to completion of SPE functions. MGPS is invoked upon arrivals and departures of tasks. Initially, upon arrivals, the scheduler conservatively assigns one SPE to each off-loaded task. Upon a departure, the scheduler monitors the degree of task-level parallelism exposed by each MPI process, i.e. how many discrete tasks were off-loaded to SPEs while the departing task was executing. This number reflects the history of SPE utilization from task-level parallelism and is used to switch from the EDTLP scheduling policy to a hybrid EDTLP-LLP scheduling policy. The scheduler monitors the number of SPEs that execute tasks over epochs of 100 off-loads. If the observed SPE utilization is over 50% the scheduler maintains the most recently selected scheduling policy (EDTLP or EDTLP-LLP). If the observed SPE utilization falls under 50% and the scheduler uses EDTLP, it switches to EDTLP-LLP by loading parallelized versions of the loops in the local storages of SPEs and performing loop distribution. To switch between different parallel execution models at runtime, the runtime system uses code versioning. It maintains three versions of the code of each task. One version is used for execution on the PPE. The second version is used for execution on an SPE from start to finish, using SIMDization to exploit the vector execution units of the SPE. The third version is used for distribution of the loop enclosed by the task between more than one SPEs. The use of code 47

64 versioning increases code management overhead, as SPEs may need to load different versions of the code of each off-loaded task at runtime. On the other hand, code versioning obviates the need for conditionals that would be used in a monolithic version of the code. These conditionals are expensive on SPEs, which lack branch prediction capabilities. Our experimental analysis indicates that overlaying code versions on the SPEs via code transfers ends up being slightly more efficient than using monolithic code with conditionals. This happens because of the overhead and frequency of the conditionals in the monolithic version of the SPE code, but also because the code overlays leave more space available in the local storage of SPEs for data caching and buffering to overlap computation and communication [20]. We compare MGPS to EDTLP and two static hybrid (EDTLP-LLP) schedulers, using 2 SPEs per loop and 4 SPEs per loop respectively. Figure 5.6 shows the execution times of MGPS, EDTLP-LLP and EDTLP with various RAxML workloads. The x-axis shows the number of bootstraps, while the y-axis shows execution time. We observe benefits from using MGPS for up to 28 bootstraps. Beyond 28 bootstraps, MGPS converges to EDTLP and both are increasingly faster than static EDTLP-LLP execution, as the number of bootstraps increases. A clear disadvantage of MGPS is that the time needed for any adaptation decision depends on the total number of off-loading requests, which in turn is inherently application-dependent and input-dependent. If the off-loading requests from different processes are spaced apart, there may be extended idle periods on SPEs, before adaptation takes place. Another disadvantage of MGPS is the dependency of its dynamic scheduling policy on the initial configuration used to execute the application. In RAxML, MGPS converges to the best execution strategy only if the application begins by oversubscribing the PPE and exposing the maximum degree of task-level parallelism to the runtime system. This strategy is unlikely to converge to the best scheduling policy in other applications, where task-level parallelism is limited and data parallelism is more dominant. In this case, MGPS would have to commence its optimization process from a different program configuration favoring data-level rather than task-level parallelism. We address the aforementioned shortcomings via a sampling-based MGPS algorithm (S-MGPS), which we 48

65 Execution time in seconds MGPS EDTLP+LLP with 4 SPEs per parallel loop EDTLP+LLP with 2 SPEs per parallel loop EDTLP Number of bootstraps (a) Execution time in seconds MGPS EDTLP+LLP with 4 SPEs per parallel loop EDTLP+LLP with 2 SPEs per parallel loop EDTLP Number of bootstraps (b) Figure 5.6: MGPS, EDTLP and static EDTLP-LLP. Input file: 42 SC. Number of ML trees created: (a) 1 16, (b) introduce in the next section. 5.4 S-MGPS We begin this section by presenting a motivating example to show why controlling concurrency on the Cell is useful, even if SPEs are seemingly fully utilized. This example motivates the introduction of a sampling-based algorithm that explores the space of program and system 49

66 configurations that utilize all SPEs, under different distributions of SPEs between concurrently executing tasks and parallel loops. We present S-MGPS and evaluate S-MGPS using RAxML and PBPI Motivating Example Increasing the degree of task parallelism on Cell comes at a cost, namely increasing contention between MPI processes that time-share the PPE. Pairs of processes that execute in parallel on the PPE suffer from contention for shared resources, a well-known problem of simultaneous multithreaded processors. Furthermore, with more processes, context switching overhead and lack of co-scheduling of SPE threads and PPE threads from which the SPE threads originate, may harm performance. On the other hand, while loop-level parallelization can ameliorate PPE contention, its performance benefit depends on the granularity and locality properties of parallel loops. Figure 5.7 shows the efficiency of loop-level parallelism in RAxML when the input data set is relatively small. The input data set in this example (25 SC) has 25 organisms, each of them represented by a DNA sequence of 500 nucleotides. In this experiment, RAxML is executed multiple times with a single worker process and a variable number of SPEs used for LLP. The best execution time is achieved with 5 SPEs. The behavior illustrated in Figure 5.7 is caused by several factors, including the granularity of loops relative to the overhead of PPE-SPE commnication and load imbalance (discussed in Section 5.2.2). By using two dimensions of parallelism to execute an application, the runtime system can control both PPE contention and loop-level parallelization overhead. Figure 5.8 illustrates an example in which multi-grain parallel executions outperform one-dimensional parallel executions in RAxML, for any number of bootstraps. In this example, RAxML is executed with three static parallelization schemes, using 8 MPI processes and 1 SPE per process, 4 MPI processes and 2 SPEs per process, or 2 MPI processes and 4 SPEs per process respectively. The input data 50

67 Exeutin time (s) Number of SPEs Figure 5.7: Execution time of RAxML with a variable number of SPE threads. The input dataset is 25 SC. 400 Execution time (s) worker processes, 1 SPE per off-loaded task 4 worker processes, 2SPEs per off-loaded task 2 worker processes, 4 SPEs per off-loaded task Number of bootstraps Figure 5.8: Execution times of RAxML, with various static multi-grain scheduling strategies. The input dataset is 25 SC. set is 25 SC. Using this data set, RAxML performs the best with a multi-level parallelization model when 4 MPI processes are simultaneously executed on the PPE and each of them uses 2 SPEs for loop-level parallelization Sampling-Based Scheduler for Multi-grain Parallelism The S-MGPS scheduler automatically determines the best parallelization scheme for a specific workload, by using a sampling period. During the sampling period, S-MGPS performs a search 51

68 of program configurations along the available dimensions of parallelism. The search starts with a single MPI process and during the first step S-MGPS determines the optimal number of SPEs that should be used by a single MPI process. The search is implemented by sampling execution phases of the MPI process with different degrees of loop-level parallelism. Phases represent code that is executed repeatedly in an application and dominates execution time. In case of RAxML and PBPI, phases are the off-loaded tasks. Although we identify phases manually in our execution environment, the selection process for phases is trivial and can be automated in a compiler. Furthermore, parallel applications almost always exhibit a very strong runtime periodicity in their execution patterns, which makes the process of isolating the dominant execution phases straightforward. Once the first sampling step of S-MGPS is completed, the search continues by sampling execution intervals with every feasible combination of task-level and loop-level parallelism. In the second phase of the search, the degree of loop-level parallelism never exceeds the optimal value determined by the first sampling step. For each execution interval, the scheduler uses execution time of phases as a criterion for selecting the optimal dimension(s) and granularity of parallelism per dimension. S-MGPS uses a performance-driven mechanism to rightsize parallelism on Cell, as opposed to the utilization-driven mechanism used in MGPS. Figure 5.9 ilustrates the steps of the sampling phase when 2 MPI processes are executed on the PPE. This process can be performed for any number of MPI processes that can be executed on a single Cell node. For each MPI process, the runtime system uses a variable number of SPEs, ranging from 1 up to the optimal number of SPEs determined by the first phase of sampling. The purpose of the sampling period is to determine the configuration of parallelism that maximizes efficiency. We define a throughput metric W as: 52

69 SPE5 SPE6 SPE7 SPE8 SPE5 SPE6 SPE7 SPE8 Process1 Process1 PPE EIB PPE EIB Process2 Process2 SPE1 SPE2 SPE3 SPE4 (a) SPE1 SPE2 SPE3 SPE4 (b) SPE5 SPE6 SPE7 SPE8 SPE5 SPE6 SPE7 SPE8 Process1 Process1 PPE EIB PPE EIB Process2 Process2 SPE1 SPE2 SPE3 SPE4 (c) SPE1 SPE2 SPE3 SPE4 (d) Figure 5.9: The sampling phase of S-MGPS. Samples are taken from four execution intervals, during which the code performs identical operations. For each sample, each MPI process uses a variable number of SPEs to parallelize its enclosed loops. W = C T (5.1) where C is the number of completed tasks and T is execution time. Note that a task is defined as a function off-loaded on SPEs, therefore C captures application- and input-dependent behavior. S-MGPS computes C by counting the number of task off-loads. This metric works reasonably well, assuming that tasks of the same type (i.e. the same function or chunk of an expensive computational loop, off-loaded multiple times on an SPE) have approximately the same execution time. This is indeed the case in the applications that we studied. The metric can be easily extended so that each task is weighed with its execution time relative to the execution time of other tasks, to account for unbalanced task execution times. We do not explore this option further in this thesis. 53

70 S-MGPS calculates efficiency for every sampled configuration and selects the configuration with the maximum efficiency for the rest of the execution. In Table 5.4 we represent partial results of the sampling phase in RAxML for different input datasets. In this example, the degree of task-level parallelism sampled is 8, 4 and 2, while the degree of loop-level parallelism sampled is 1, 2 and 4. In the case of RAxML we set a single sampling phase to be time necessary for all active worker processes to finish a single bootstrap. Therefore, in the case of RAxML in Table 5.4, the number of bootstraps and the execution time differ across sampling phases: when the number of active workers is 8, the sampling phase will contain 8 bootstraps, when the number of active workers is 4 the sampling phase will contain 4 bootstraps, etc. Nevertheless, the throughput (W ) remains invariant across different sampling phases and always represents the efficiency of a certain configuration, i.e. amount of work done per second. Results presented in Table 5.4 confirm that S-MGPS converges to the optimal configurations (4x2 and 8x1) for the input files 25 SC and 42 SC. Dataset deg(tlp) # bootstr. per # off-loaded phase W deg (LLP) sampling phase tasks duration time 42 SC 8x1 8 2,526, s 60, SC 4x2 4 1,263, s 60, SC 2x , s 43, SC 8x1 8 1,261, s 76, SC 4x , s 76, SC 2x , s 53,998 Table 5.4: Efficiency of different program configurations with two data sets in RAxML. The best configuration for 42 SC input is deg(tlp)=8, deg(llp)=1. The best configuration for 25 SC is deg(tlp)=4, deg(llp)=2. deg() corresponds the degree of a given dimension of parallelism (LLP or TLP). Since the scheduler performs an exhaustive search, for the 25 SC input, the total number of bootstraps required for the sampling period on Cell is 17, for up to 8 MPI processes and 1 to 5 SPEs used per MPI process for loop-level parallelization. The upper bound of 5 SPEs per loop is determined by the first step of the sampling period. Assuming that performance is optimized if the maximum number of SPEs of the processor are involved in parallelization, the 54

71 feasible configurations to sample are constrained by deg(tlp) deg(llp)=8, for a single Cell with 8 SPEs. Under this constraint, the number of samples needed by S-MGPS on Cell drops to 3. Unfortunately, when considering only configurations that use all SPEs, the scheduler may omit a configuration that does not use all SPEs but still performs better than the best scheme that uses all processor cores. In principle, this situation may occur in certain non-scalable codes or code phases. To address such cases, we recommend the use of exhaustive search in S-MGPS, given that the total number of feasible configurations of SPEs on a Cell is manageable and small compared to the number of tasks and the number of instances of each task executed in real applications. This assumption may need to be revisited in the future for large-scale systems with many cores and exhaustive search may need to be replaced by heuristics such as hill climbing or simulated annealing. In Table 5.5 we compare the performance of S-MGPS to the static scheduling policies with both one-dimensional (TLP) and multi-grain (TLP-LLP) parallelism on Cell, using RAxML. For a small number of bootstraps, S-MGPS underperforms the best static scheduling scheme by 10%. The reason is that S-MGPS expends a significant percentage of execution time in the sampling period, while executing the program in mostly suboptimal configurations. As the number of bootstraps increases, S-MGPS comes closer to the performance of the best static scheduling scheme (within 3% 5%). deg(tlp)=8, deg(tlp)=4, deg(tlp)=2, deg(llp)=1 deg(llp)=2 deg(llp)=4 S-MGPS 32 boots. 60s 57s 80s 63s 64 boots. 117s 112s 161s 118s 128 boots. 231s 221s 323s 227s Table 5.5: RAxML Comparison between S-MGPS and static scheduling schemes, illustrating the convergence overhead of S-MGPS. To map PBPI to Cell, we used a hybrid parallelization approach where a fixed number of MPI processes is multiplexed on the PPE and multiple SPEs are used for loop-level parallelization. The performance of the parallelized off-loaded code in PBPI is influenced by the same 55

72 Execution time (s) MPI process 2 MPI processes 4 MPI processes 8 MPI processes Number of SPEs Figure 5.10: PBPI executed with different levels of TLP and LLP parallelism: deg(tlp)=1-4, deg(llp)=1 16 factors as in RAxML: granularity of the off-loaded code, PPE-SPE communication, and load imbalance. In Figure 5.10 we present the performance of PBPI when a variable number of SPEs is used to execute the parallelized off-loaded code. The input file we used in this experiment is 107 SC, including 107 organisms, each represented by a DNA sequence of 1,000 nucleotides. We run PBPI with one Markov chain for 200,000 generations. Figure 5.10 contains four executions of PBPI with 1, 2, 4 and 8 MPI processes with 1 16, 1 8, 1 4 and 1 2 SPEs used per MPI process respectively. In all experiments we use a single BladeCenter with two Cell BE processors (total of 16 SPEs). In the experiments with 1 and 2 MPI processes, the off-loaded code scales successfully only up to a certain number of SPEs, which is always smaller than the number of total available SPEs. Furthermore, the best performance in these two cases is reached when the number of SPEs used for parallelization is smaller than the total number of available SPEs. The optimal number of SPEs in general depends on the input data set and on the outermost parallelization and data decomposition scheme of PBPI. The best performance for the specific dataset is reached by using 4 MPI processes, spread across 2 Cell BEs, with each process using 4 SPEs on one Cell BE.This optimal operating point shifts with different data set sizes. The fixed virtual processor topology and data decomposition method used in PBPI prevents 56

73 dynamic scheduling of MPI processes at runtime without excessive overhead. We have experimented with the option of dynamically changing the number of active MPI processes via a gang scheduling scheme, which keeps the total number of active MPI processes constant, but co-schedules MPI processes in gangs of size 1, 2, 4, or 8 on the PPE and uses 8, 4, 2, or 1 SPE(s) per MPI process per gang respectively, for the execution of parallel loops. This scheme also suffered from system overhead, due to process control and context switching on the SPEs. Pending better solutions for adaptively controlling the number of processes in MPI, we evaluated S-MGPS in several scenarios where the number of MPI processes remains fixed. Using S-MGPS we were able to determine the optimal degree of loop-level parallelism, for any given degree of task-level parallelism (i.e. initial number of MPI processes) in PBPI. Being able to pinpoint the optimal SPE configuration for LLP is still important since different loop parallelization strategies can result in a significant difference in execution time. For example, the naïve parallelization strategy, where all available SPEs are used for parallelization of off-loaded loops, can result in up to 21% performance degradation (see Figure 5.10). Table 5.6 shows a comparison of execution times when S-MGPS is used and when different static parallelization schemes are used. S-MGPS performs within 2% of the optimal static parallelization scheme. S-MGPS also performs up to 20% better than the naïve parallelization scheme where all available SPEs are used for LLP (see Table 5.6(b)). 5.5 Chapter Summary In this chapter we investigated policies and mechanisms pertaining to scheduling multigrain parallelism on the Cell Broadband Engine. We proposed an event-driven task scheduler, striving for higher utilization of SPEs via oversubscribing the PPE. We have explored the conditions under which loop-level parallelism within off-loaded code can be used. We have also proposed a comprehensive scheduling policy for combining task-level and loop-level parallelism autonomically within MPI code, in response to workload fluctuation. Using a bio-informatics code with 57

74 (a) deg(llp) deg(tlp)= Time deg(tlp)= Time (s) S-MGPS Time (s) (b) deg(llp) deg(tlp)= Time (s) S-MGPS Time (s) 93 (c) deg(llp) deg(tlp)= Time (s) S-MGPS Time (s) 85.9 (d) deg(llp) deg(tlp)=8 1 2 Time (s) S-MGPS Time (s) 267 Table 5.6: PBPI comparison between S-MGPS and static scheduling schemes: (a) deg(tlp)=1, deg(llp)=1 16; (b) deg(tlp)=2, deg(llp)=1 8; (c) deg(tlp)=4, deg(llp)=1 4; (d) deg(tlp)=8, deg(llp)=1 2. inherent multigrain parallelism as a case study, we have shown that our user-level scheduling policies outperform the native OS scheduler by up to a factor of 2.7. Our MGPS scheduler proves to be responsive to small and large degrees of task-level and data-level parallelism, at both fine and coarse levels of granularity. This kind of parallelism is commonly found in optimization problems where many workers are spawned to search a very large space of solutions, using a heuristic. RAxML is representative of these applications. MGPS is also appropriate for adaptive and irregular applications such as adaptive mesh refinement, where the application has task-level parallelism with variable granularity (because of load imbalance incurred while meshing subdomains with different structural properties) and, in some implementations, a statically unpredictable degree of task-level parallelism (because of non-deterministic dynamic load balancing which may be employed to improve execution time). N-body simulations and ray-tracing are applications that exhibit similar properties and can also benefit from our scheduler. As a final note, we observe that MGPS reverts to the best static scheduling scheme for regular codes with a fixed degree of task-level parallelism, such as blocked linear algebra kernels. We also investigated the problem of mapping multi-dimensional parallelism on hetero- 58

75 geneous parallel architectures with both conventional and accelerator cores. We proposed a feedback-guided dynamic scheduling scheme, S-MGPS, which rightsizes parallelism on the fly, without a priori knowledge of application-specific information and regardless of the input data set. 59

76 60

77 Chapter 6 Model of Multi-Grain Parallelism 6.1 Introduction The migration of parallel programming models to accelerator-based architectures raises many challenges. Accelerators require platform-specific programming interfaces and re-formulation of parallel algorithms to fully exploit the additional hardware. Furthermore, scheduling code on accelerators and orchestrating parallel execution and data transfers between host processors and accelerators is a non-trivial exercise, as discussed in Chapter 5. Although being able to accurately determine the most efficient execution configuration of a multi-level parallel application, the S-MGPS scheduler (Section 5.4) requires sampling of many different configurations, at runtime. The sampling time grows with the number of accelerators on the chip, and with the number of different levels of parallelism available in the application. To pinpoint the most efficient execution configuration without using the sampling phase, we develop a model for multi-dimensional parallel computation on heterogeneous multi-core processors. We name the model Model of Multi-Grain Parallelism (MMGP). The model is applicable to any type of accelerator based architecture, and in Section 6.4 we test the accuracy and usability of the MMGP model on the multicore Cell architecture. 61

78 HPU/LM #1 HPU/LM #N HP Shared Memory / Message Interface APU/LM #1 APU/LM #2 APU/LM #N AP Figure 6.1: A hardware abstraction of an accelerator-based architecture with two layers of parallelism. Host processing units (HPUs) relatively supply coarse-grain parallel computation across accelerators. Accelerator processing units (APUs) are the main computation engines and may support internally finer grain parallelism. Both HPUs and APUs have local memories and communicate through shared-memory or message-passing. Additional layers of parallelism can be expressed hierarchically in a similar fashion. 6.2 Modeling Abstractions Performance can be dramatically affected by the assignment of tasks to resources on a complex parallel architecture with multiple types of parallel execution vehicles. We intend to create a model of performance that captures the important costs of parallel task assignment at multiple levels of granularity, while maintaining simplicity. Additionally, we want our techniques to be independent of both programming models and the underlying hardware. Thus, in this section we identify abstractions necessary to allow us to define a simple, accurate model of parallel computation for accelerator-based architectures Hardware Abstraction Figure 6.1 shows our abstraction for accelerator-based architectures. In this abstraction, each node consists of multiple host processing units (HPU) and multiple accelerator processing units (APU). Both the HPUs and APUs have local and shared memory. Multiple HPU-APU nodes form a cluster. We model the communication cost for i and j, where i and j are HPUs, APUs, 62

79 and/or HPU-APU nodes, using a variant of the LogP model [35] of point-to-point communication: C i,j = O i + L + O j (6.1) Where C i,j is the communication cost, O i and O j is the overhead of sender and receiver respectively, and L is the communication latency. In this hardware abstraction, we model an HPU, APU, or HPU-APU node as a sequential device with streaming memory accesses. For simplicity, we assume that additional levels of parallelism in HPUs or APUs, such as ILP and SIMD, can be reflected with a parameter that represents computing capacity. We could alternatively express multi-grain parallelism hierarchically, but this complicates model descriptions without much added value. Assumption of streaming memory accesses, allows inclusion of the effects of communication and computation overlap Application Abstraction Figure 6.2 provides an illustrative view of the succeeding discussion. We model the workload of a parallel application using a version of the Hierarchical Task Graph (HTG [52]). An HTG represents multiple levels of concurrency with progressively finer granularity when moving from outermost to innermost layers. We use a phased HTG, in which we partition the application into multiple phases of execution and split each phase into nested sub-phases, each modeled as a single, potentially parallel task. Each subtask may incorporate one or more layers of data or sub-task parallelism. The degree of concurrency may vary between tasks and within tasks. Mapping a workload with nested parallelism as shown in Figure 6.2 to an accelerator-based multi-core architecture can be challenging. In the general case, any application task of any granularity could map to any type combination of HPUs and APUs. The solution space under these conditions can be unmanageable. In this work, we confine the solution space by making some assumptions about the applica- 63

80 Main Process Task1 Task2 Subtask3 Subtask2 Subtask1 Subtask3 Subtask2 Subtask1 Task1 Task2 Time Main Process Figure 6.2: Our application abstraction of two parallel tasks. Two tasks are spawned by the main process. Each task exhibits phased, multi-level parallelism of varying granularity. In this paper, we address the problem of mapping tasks and subtasks to accelerator-based systems. tion and hardware. First, we assume that the amount and type of parallelism is known a priori for all phases in the application. In other words, we assume that the application is explicitly parallelized, in a machine-independent fashion. More specifically, we assume that the application exposes all available layers of inherent parallelism to the runtime environment, without however specifying how to map this parallelism to parallel execution vehicles in hardware. In other words, we assume that the application s parallelism is expressed independently of the number and the layout of processors in the architecture. The parallelism of the application is represented by a phased HTG graph. The intent of our work is to improve and formalize programming of accelerator-based multicore architectures. We believe it is not unreasonable to assume those interested in porting code and algorithms to such systems would have detailed knowledge about the inherent parallelism of their application. Furthermore, explicit, processor-independent parallel programming is considered by many as a means to simplify parallel programming models [10]. Second, we prune the number and type of hardware configurations. We assume hardware 64

81 configurations consist of a hierarchy of nested resources, even though the actual resources may not be physically nested in the architecture. Each resource is assigned to an arbitrary level of parallelism in the application and resources are grouped by level of parallelism in the application. For instance, the Cell Broadband Engine can be considered as 2 HPUs and 8 APUs, where the two HPUs correspond to the PowerPC dual-thread SMT core and APUs to the synergistic (SPE) accelerator cores. HPUs support parallelism of any granularity, however APUs support the same or finer, not coarser, granularity. This assumption is reasonable since it represents faithfully all current accelerator architectures, where front-end processors offload computation and data to accelerators. This assumption simplifies modeling of both communication and computation. 6.3 Model of Multi-grain Parallelism This section provides theoretical rigor to our approach. We present MMGP, a model which predicts execution time on accelerator-based system configurations and applications under the assumptions described in the previous section. Readers familiar with point-to-point models of parallel computation may want to skim this section and continue directly to the results of our execution time prediction techniques discussed in Section 6.4. We follow a bottom-up approach. We begin by modeling sequential execution on the HPU, with part of the computation off-loaded to a single APU. Next, we incorporate multiple APUs in the model, followed by multiple HPUs. We end up with a general model of execution time, which is not particularly practical. Hence, we reduce the general model to reflect different uses of HPUs and APUs on real systems. More specifically, we specialize the model to capture the scheduling policy of threads on the HPUs and to estimate execution times under different mappings of multi-grain parallelism across HPUs and APUs. Lastly, we describe the methodology we use to apply MMGP to real systems. 65

82 shared Memory HPU_1 Phase_1 Phase_2 APU_1 (a) an architecture with one HPU and one APU Phase_3 (b) an application with three phases Figure 6.3: The sub-phases of a sequential application are readily mapped to HPUs and APUs. In this example, sub-phases 1 and 3 execute on the HPU and sub-phase 2 executes on the APU. HPUs and APUs are assumed to communicate via shared memory Modeling sequential execution As the starting point, we consider the mapping of the program to an accelerator-based architecture that consists of one HPU and one APU, and an application with one phase decomposed into three sub-phases, a prologue and epilogue running on the HPU, and a main accelerated phase running on the APU, as illustrated in Figure 6.3. Offloading computation incurs additional communication cost, for loading code and data on the APU, and saving results calculated from the APU. We model each of these communication costs with a latency and an overhead at the end-points, as in Equation 6.1. We assume that APU s accesses to data during the execution of a procedure are streamed and overlapped with APU computation. This assumption reflects the capability of current streaming architectures, such as the Cell and Merrimac [37], to aggressively overlap memory latency with computation, using multiple buffers. Due to overlapped memory latency, communication overhead is assumed to be visible only during loading the code and arguments of a procedure on the APU and during returning the result of a procedure from the APU to the HPU. We combine the communication overhead for offloading the code and arguments of a procedure and signaling the execution of that procedure on the APU in one term (O s ), and the overhead for returning the result of a procedure from the APU to the HPU in another term (O r ). We can model the execution time for the offloaded sequential execution for sub-phase 2 in 66

83 Figure 6.3 as: T offload (w 2 ) = T AP U (w 2 ) + O r + O s (6.2) where T AP U (w 2 ) is the time needed to complete sub-phase 2 without additional overhead. Further, we can write the total execution time of all three sub-phases as: T = T HP U (w 1 ) + T AP U (w 2 ) + O r + O s + T HP U (w 3 ) (6.3) To reduce complexity, we replace T HP U (w 1 )+T HP U (w 3 ) with T HP U, T AP U (w 2 ) with T AP U, and O s + O r with O offload. Therefore, we can rewrite Equation 6.3 as: T = T HP U + T AP U + O offload (6.4) The application model in Figure 6.3 is representative of one of potentially many phases in an application. We further modify Equation 6.4 for a generic application with N phases, where each phase i offloads a part of its computation on one APU: T = N (T HP U,i + T AP U,i + O offload ) (6.5) i= Modeling parallel execution on APUs Each offloaded part of a phase may contain fine-grain parallelism, such as task-level parallelism at the sub-procedural level or data-level parallelism in loops. This parallelism can be exploited by using multiple APUs for the offloaded workload. Figure 6.4 shows the execution time decomposition for execution using one APU and two APUs. We assume that the code off-loaded to an APU during phase i, has a part which can be further parallelized across APUs, and a part executed sequentially on the APU. We denote T AP U,i (1, 1) as the execution time of the further parallelized part of the APU code during the i th phase. The first index 1 refers to the use of one HPU thread in the execution. We denote T AP U,i (1, p) as the execution time of the same 67

84 part when p APUs are used to execute this part during the i th phase. We denote as C AP U,i the non-parallelized part of APU code in phase i. Therefore, we obtain: T AP U,i (1, p) = T AP U,i(1, 1) p + C AP U,i (6.6) Overhead associated with offloading (gap) PPE, SPE Computation HPU APU T HPU,i (1,1) Offloading gap HPU APU1 APU2 T HPU,i (1,2) Os T (1,1) APU,i C Or APU T i (1,1) Receiving gap Os TAPU (1,2) CAPU Or T i (1,2) Time (a) Offloading to one APU Time (b) offloading to two APUs Figure 6.4: Parallel APU execution. The HPU (leftmost bar in parts a and b) offloads computations to one APU (part a) and two APUs (part b). The single point-to-point transfer of part a is modeled as overhead plus computation time on the APU. For multiple transfers, there is additional overhead (g), but also benefits due to parallelization. Given that the HPU offloads to APUs sequentially, there exists a latency gap between consecutive offloads on APUs. Similarly, there exists a gap between receiving return values from two consecutive offloaded procedures on the HPU. We denote with g the larger of the two gaps. On a system with p APUs, parallel APU execution will incur an additional overhead as large as p g. Thus, we can model the execution time in phase i as: T i (1, p) = T HP U,i + T AP U,i(1, 1) p 68 + C AP U,i + O offload + p g (6.7)

85 6.3.3 Modeling parallel execution on HPUs An accelerator-based architecture can support parallel HPU execution in several ways, by providing a multi-core HPU, an SMT HPU or combinations thereof. As a point of reference, we consider an architecture with one SMT HPU, which is representative of the Cell BE. Since the compute intensive parts of an application are typically offloaded to APUs, the HPUs are expected to be in idle state for extended intervals. Therefore, multiple threads can be used to reduce idle time on the HPU and provide more sources of work for APUs, so that APUs are better utilized. It is also possible to oversubscribe the HPU with more threads than the number of available hardware contexts, in order to expose more parallelism via offloading on APUs. Figure 6.5 illustrates the execution timeline when two threads share the same HPU, and each thread offloads parallelized code on two APUs. We use different shade patterns to represent the workload of different threads T i (2,2) HPU Thread 1 HPU Thread 2 THPU,i 2 (2,2) O s 2 T APU,i C APU (2,2) Or APU4 APU3 HPU APU1 APU2 T 1 (2,2) HPU,i Os 1 T APU,i (2,2) C APU Or 1 T i (2,2) Figure 6.5: Parallel HPU execution. The HPU (center bar) offloads computations to 4 APUs (2 on the right and 2 on the left). The first thread on the HPU offloads computation to APU1 and APU2 then idles. The second HPU thread is switched in, offloads code to APU3 and APU4, and then idles. APU1 and APU2 complete and return data followed by APU3 and APU4. 69

86 For m concurrent HPU threads, where each thread uses p APUs for distributing a single APU task, the execution time of a single off-loading phase can be represented as: T k i (m, p) = T k HP U,i(m, p) + T k AP U,i(m, p) + O offload + p g (6.8) where T k i (m, p) is the completion time of the k th HPU thread during the i th phase. Modeling the APU time Similarly to Equation 6.6, we can write the APU time of the k-th thread in phase i in Equation 6.8 as: T k AP U,i(m, p) = T AP U,i(m, 1) p + C AP U,i (6.9) Different parallel implementations may result in different T AP U,i (m, 1) terms and a different number of offloading phases. For example, the implementation could parallelize each phase among m HPU threads and then offload the work of each HPU thread to p APUs, resulting in the same number of offloading phases and a reduced APU time during each phase, i.e., T AP U,i (m, 1) = T AP U,i(1,1). As another example, the HPU threads can be used to execute multi- m ple identical tasks, resulting in a reduced number of offloading phases (i.e., N/m, where N is the number of offloading phases when there is only one HPU thread) and the same APU time in each phase, i.e., T AP U,i (m, 1) = T AP U,i (1, 1). Modeling the HPU time The execution time of each HPU thread is affected by three factors: 1. Contention between HPU threads for shared resources. 2. Context switch overhead related to resource scheduling. 3. Global synchronization between dependent HPU threads. 70

87 Considering all three factors, we can model the execution time of an HPU thread in phase i as: T k HP U,i(m, p) = α m T HP U,i (1, p) + T CSW + O COL (6.10) In this equation T CSW is context switching time on the HPU, and O COL is the time needed for collective communication. The parameter α m is introduced to account for contention between threads that share resources on the HPU. On SMT and CMP HPUs, such resources typically include one or more levels of the on-chip cache memory. On SMT HPUs in particular, shared resources include also TLBs, branch predictors and instruction slots in the pipeline. Contention between threads often introduces artificial load imbalance due to occasional unfair hardware policies of allocating resources between threads. Synthesis Combining Equation (6.8)-(6.10) and summarizing all phases, we can write the execution time for MMGP as: T (m, p) = α m T HP U (1, 1)+ T AP U(1, 1) +C AP U +N (O Offload +T CSW +O COL +p g) (6.11) m p Due to limited hardware resources (i.e. number of HPUs and APUs), we further constrain this equation to m p N AP U, where N AP U is the number of available APUs. As described later in this paper, we can either measure or approximate all parameters in Equation 6.11 from microbenchmarks and profiles of sequential runs of the program Using MMGP Given a parallel application, MMGP can be applied using the following process: 1. Calculate parameters including O Offload, α m, T CSW and O COL using micro-benchmarks for the target platform. 71

88 2. Profile a short run of the sequential execution with off-loading to a single APU, to estimate T HP U (1), g, T AP U (1, 1) and C AP U. 3. Solve a special case of Equation 6.11 (e.g. 6.7) to find the optimal mapping between application concurrency and HPUs and APUs available on the target platform MMGP Extensions We note that the concepts and assumptions mentioned in this section do not preclude further specialization of MMGP for higher accuracy. For example, in Section we assume computation and data communication overlap. This assumption reflects the fact that streaming processors can typically overlap completely memory access latency with computation. For non-overlapped memory accesses, we can employ a DMA model as a specialization of the overhead factors in MMGP. Also, in Sections and we assume only two levels of parallelism. MMGP is easily extensible to additional levels but the terms of the equations grow quickly without conceptual additions. Furthermore, MMGP can be easily extended to reflect specific scheduling policies for threads on HPUs and APUs, as well as load imbalance in the distribution of tasks between HPUs and APUs. To illustrate the usefulness of our techniques we apply them to a real system. We next present results from applying MMGP to Cell. 6.4 Experimental Validation and Results We use MMGP to derive multi-grain parallelization schemes for two bioinformatics applications, RAxML and PBPI, described in Chapter 3, on a shared-memory dual Cell blade, IBM QS20. Although we are using only two applications in our experimental evaluation, we should point out that these are complete applications used for real-world biological data analyses, and that they are fully optimized for the Cell BE using an arsenal of optimizations, including vectorization, loop unrolling, double buffering, if-conversion and dynamic scheduling. Furthermore, 72

89 these applications have inherent multi-grain concurrency and non-trivial scaling properties in their phases, therefore scheduling them optimally on Cell is a challenging exercise for MMGP. Lastly, in the absence of comprehensive suites of benchmarks (such as NAS or SPEC HPC) ported on Cell, optimized, and made available to the community by experts, we opted to use, PBPI and RAxML, codes on which we could verify that enough effort has been invested towards Cell-specific parallelization and optimization MMGP Parameter approximation MMGP has eight free parameters, T HP U, T AP U, C AP U, O offload, g, T CSW, O COL and α m. We estimate four of the parameters using micro-benchmarks. α m captures contention between processes or threads running on the PPE. This contention depends on the scheduling algorithm on the PPE. We estimate α m under an event-driven scheduling model which oversubscribes the PPE with more processes than the number of hardware threads supported for simultaneous execution on the PPE, and switches between processes upon each off-loading event on the PPE [19]. To estimate α m, we use a parallel micro-benchmark that computes the product of two M M square matrices consisting of double-precision floating point elements. Matrix-matrix multiplication involves O(n 3 ) computation and O(n 2 ) data transfers, thus stressing the impact of sharing execution resources and the L1 and L2 caches between processes on the PPE. We used several different matrix sizes, ranging from to , to exercise different levels of pressure on the thread-shared caches of the PPE. In the MMGP model, we use the mean of α m obtained from these experiments, which is PPE-SPE communication is optimally implemented through DMAs on Cell. We devised a ping-pong micro-benchmark using DMAs to send a single word from the PPE to one SPE and backwards. We measured PPE SPE PPE round-trip communication overhead (O offload ) to 70 ns. To measure the overhead caused by various collective communications we used 73

90 mpptest [55] on the PPE. Using a micro-benchmark that repeatedly executes the sched yield() system call, we estimate the overhead caused by the context switching (T CSW ) on the PPE to be 2 µs. This is a conservative upper bound for context switching overhead, since it includes some user-level library overhead. T HP U, T AP U, C AP U and the gap g between consecutive DMAs on the PPE are applicationdependent and cannot be approximated easily with a micro-benchmark. To estimate these parameters, we use a profile of a sequential run of the code, with tasks off-loaded on one SPE. We use the timing instructions inserted into the applications at specific locations. To estimate T HP U we measure the time that applications spend on the HPU. To estimate T AP U and C AP U we measure the time that applications spend on the accelerators, in large computational loops which can be parallelized (T AP U ), and in the sequential accelerator code outside of the large loops (C AP U ). To estimate g, we measure the time intervals between the consecutive task off-loads and task completions Case Study I: Using MMGP to parallelize PBPI PBPI with One Dimension of Parallelism We compare the PBPI execution times predicted by MMGP to the actual execution times obtained on real hardware, using various degrees of PPE and SPE parallelism, i.e. the equivalents of HPU and APU parallelism on Cell. These experiments illustrate the accuracy of MMGP, in a sample of the feasible program configurations. The sample includes one-dimensional decompositions of the program between PPE threads, with simultaneous off-loading of code to one SPE from each PPE thread, one-dimensional decompositions of the program between SPE threads, where the execution of tasks on the PPE is sequential and each task off-loads code which is data-parallel across SPEs, and two-dimensional decompositions of the program, where multiple tasks run on the PPE threads concurrently and each task off-loads code which is data-parallel across SPEs. In all cases, the SPE code is SIMDized in the innermost loops, to exploit the vec- 74

91 (a) (b) Figure 6.6: MMGP predictions and actual execution times of PBPI, when the code uses one dimension of PPE (HPU) parallelism. tor units of the SPEs. We believe that this sample of program configurations is representative of what a user would reasonably experiment with while trying to optimize the codes on the Cell. For these experiments, we used the arch107 L10000 input data set. This data set consists of 107 sequences, each with characters. We run PBPI with one Markov chain for generations. Using the time base register on the PPE and the decrementer register on one SPE, we obtained the following model parameters for PBPI: T HP U = 1.3s, T AP U = 370s, g = 0.8s and O = 1.72s. Figure 6.6 compares MMGP and actual execution times for PBPI, when PBPI only exploits one-dimensional PPE (HPU) parallelism in which each PPE thread uses one SPE for off-loading. We execute the code with up to 16 MPI processes, which off-load code to up to 16 SPEs on two Cell BEs. Referring to Equation 6.11, we set p = 1 and vary the value of m between 1 to 8. The X-axis shows the number of processes running on the PPE (i.e. HPU parallelism), and the Y-axis shows the predicted and measured execution times. The maximum prediction error of MMGP is 5%. The arithmetic mean of the error is 2.3% and the standard deviation is 1.4. Figure 6.7 illustrates predicted and actual execution times when PBPI uses one dimension 75

Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors

Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors Filip Blagojevic, Xizhou Feng, Kirk W. Cameron and Dimitrios S. Nikolopoulos Center for High-End Computing Systems Department of Computer