Extending OpenMP to Survive the Heterogeneous Multi-Core Era

Size: px

Start display at page:

Download "Extending OpenMP to Survive the Heterogeneous Multi-Core Era"

Ronald Cooper
5 years ago
Views:

1 DOI /s Extending OpenMP to Survive the Heterogeneous Multi-Core Era Eduard Ayguadé Rosa M. Badia Pieter Bellens Daniel Cabrera Alejandro Duran Roger Ferrer Marc Gonzàlez Francisco Igual Daniel Jiménez-González Jesús Labarta Luis Martinell Xavier Martorell Rafael Mayo Josep M. Pérez Judit Planas Enrique S. Quintana-Ortí Received: 27 April 2010 / Accepted: 27 April 2010 Springer Science+Business Media, LLC 2010 Abstract This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific E. Ayguadé (B) R. M. Badia P. Bellens D. Cabrera A. Duran R. Ferrer M. Gonzàlez D. Jiménez-González J. Labarta L. Martinell X. Martorell J. M. Pérez J. Planas Barcelona Supercomputing Center (Centro Nacional de Supercomputación (BSC-CNS)), Barcelona, Spain eduard.ayguade@bsc.es P. Bellens pieter.bellens@bsc.es D. Cabrera daniel.cabrera@bsc.es A. Duran alex.duran@bsc.es R. Ferrer roger.ferrer@bsc.es M. Gonzàlez marc.gonzalez@bsc.es D. Jiménez-González daniel.jimenez@bsc.es J. Labarta jesus.labarta@bsc.es L. Martinell luis.martinell@bsc.es X. Martorell xavier.martorell@bsc.es

2 code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer. Keywords Parallel computing Programming models Runtime systems Task-level parallelism Multi-core processors Hardware accelerators Heterogeneous computing 1 Introduction In response to the combined hurdles of power dissipation, large memory latency, and little instruction-level parallelism left to be exploited, all major hardware manufacturers have adopted the replication of cores on-chip as the mainstream path to deliver higher performance [1]. Today, chips with a few general-purpose, fully-functional cores are available from Intel (2 6 cores), AMD (2 4 cores), or Sun (8 cores), to name a few, and the number of cores is expected to increase with each shrink of the process technology. Chips in the near future will potentially integrate hundreds or thousands of cores. Graphics processors (GPUs) from NVIDIA and AMD/ATI, on the other hand, are already in the many-core era, featuring hundreds of fine-grained stream cores per processor (up to 240 and 320 in their latest designs, respectively). Together with GPUs, hardware accelerators like the heterogeneous IBM/Sony/Toshiba Cell B.E., Clear- Speed ASICs or the FPGAs from multiple vendors are appealing in that, compared with J. M. Pérez josep.m.perez@bsc.es J. Planas judit.planas@bsc.es E. Ayguadé M. Gonzàlez D. Jiménez-González J. Labarta X. Martorell Depto. de Arquitectura de Computadores, Universitat Politècnica de Catalunya, Barcelona, Spain R. M. Badia IIIA, Artificial Intelligence Research Institute, CSIC, Spanish National Research Council, Madrid, Spain rosa.m.badia@bsc.es F. Igual R. Mayo E. S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I (UJI), Castellon, Spain F. Igual figual@icc.uji.es R. Mayo mayo@icc.uji.es E. S. Quintana-Ortí quintana@icc.uji.es

3 general-purpose multi-core processors, they offer much higher performance-cost and performance-power ratios for certain applications. Therefore, we envision a future of heterogeneous architectures, equipped with a few coarse-grain general-purpose cores, and several accelerators, probably of a different nature. Applications will be solved on the most appropriate technology, while other parts of the processor may be turned off, to decrease the power used by the chip. While the increasing number of transistors on-chip dictated by Moore s Law indicates that many-core processors, with a blend of heterogeneous technologies, will be feasible in the near future, availability of applications for these architectures, and more specifically, programming models will really determine the success or failure of these designs. How easy it is to develop programs that efficiently exploit parallelism at all levels (instruction, data, task, and node) in these parallel processors is the key to the future. The heterogeneous nature of these systems, and the existence of multiple separate address spaces, only exacerbates the height of the programmability wall. The majority of proposals in this current Tower of Babel era assume a hostdirected programming and execution model with attached accelerator devices. The bulk of the user s application executes on the host while user-specified code regions are offloaded to the accelerator. In general, the specifics of the different accelerator architectures makes programming extremely difficult if one plans to use the vendorprovided SDKs (e.g., libspe for Cell B.E. or CUDA for NVIDIA GPUs). It would be desirable to retain most of the advantages of using these SDKs, but in a productive and portable manner, avoiding the mix of hardware specific code (for task offloading, data movement, etc.) with application code. The recent attempt of OpenCL to unify the programming models for architectures based on hardware accelerators tries to ensure portability, low-level access to the hardware, and supposedly high performance. We believe, however, that OpenCL still exposes much of the low-level details, making it cumbersome to use for non-experts. OpenMP [2] survived the explosion of parallel programming languages of the 90s to become the standard for parallelizing regular applications on shared-memory multiprocessors. While recent additions to OpenMP 3.0 accommodate task parallelism, making it more suitable for irregular codes, OpenMP is not ready for the challenges posed by the new generation of multi-core heterogeneous architectures. Star Superscalar (StarSs) [3] is a promising programming model in this direction that we have used as the starting point to propose a set of extensions to OpenMP. StarSs has its roots in the runtime exploitation of task parallelism, with special emphasis on portability, simplicity and flexibility. Functions that are suitable for parallel execution are annotated as tasks, and their arguments are tagged with their directionality (input, output or both). This information is provided using a simple OpenMP-like annotation of the source code (e.g., pragmas in C/C++). This information is used at runtime to build a task dependence graph dynamically. This graph is one of the main elements used by the runtime to schedule tasks as soon as all their dependences are honored and the appropriate resource to execute them is available. For those architectures with local memories, the runtime also takes care of moving in/out the associated data. The scheduler may also be driven by data locality policies; other target-dependent optimizations of the scheduler can also be incorporated into the general framework of StarSs. Current instantiations of the StarSs programming model and tools

4 include GRIDSs (for the Grid), CellSs (for the Cell B.E.), SMPSs (for general-purpose multi-core processors) and GPUSs (for platforms with multiple GPUs). We are currently extending it to cover platforms with multiple generic hardware accelerators and FPGAs. 2 The StarSs Extensions to OpenMP OpenMP has been traditionally employed to exploit loop-based parallelism, present in most regular scientific and engineering applications, on shared-memory multiprocessors. Recently, OpenMP 3.0 has been extended with tasks, or deferrable units of work, to accommodate irregular applications also. In particular, in OpenMP 3.0 the programmer can specify tasks, and later ensure that all the tasks defined up to some point have finished. Tasks are generated in the context of a team of threads, while the parallel construct creates such team. A task is created when the code reaches the task construct, defined as follows: #pragma omp task [clause-list] structured-block Valid clauses for this construct are untied, if, shared, private and firstprivate. Theuntied clause specifies that the task can be resumed by a different thread after a possible task switching point; when the expression in the if clause evaluates to false, the encountering thread suspends the current task region and begins execution of the new task immediately. The last three clauses are used for setting data sharing attributes of variables in the task body, and have the following syntax: shared( variable-list ) private( variable-list ) firstprivate( variable-list ) where variable-list is a list of identifiers. Naming a variable inside a data sharing clause explicitly sets the data sharing attribute for the variable in the task construct. References in the task construct to variables for which the data sharing attribute is private or firstprivate do not refer to the original variable but to a private storage of the task. Variables annotated with firstprivate, in addition, will have such storage initialized with the value of the original variable when the program execution reaches the task construct. References to variables for which the data sharing attribute is shared refer to the original variable. StarSs extends the task mechanism in OpenMP 3.0 to allow the specification of dependencies between tasks and to map the execution of certain tasks to a type of hardware accelerator (a device). StarSs considers each accelerator (e.g., a SPE, a GPU, a FPGA) as a single execution unit, which can efficiently execute specialized pieces of code. The runtime implementation of the model isolates the user from all the complexities related to task scheduling and offloading. The StarSs extensions are orthogonal to other possible extensions to generate efficient code by a compiler (e.g., vectorization width, number of threads running on accelerators and code transformations). In the next sections, we explore how these extensions could be mapped to the OpenMP language.

5 2.1 Taskifying Functions and Expressing Dependencies StarSs can specify that a function should be executed as a task. To allow this, we have extended the OpenMP task construct to annotate functions in addition to structured blocks: #pragma omp task [clause-list] {function-declaration function-definition structured-block} Whenever the program calls a function annotated in this way, the runtime will create an explicit task. Although this seems to be a simple and naive extension, it associates an implicit name to the task that will be used later in sect We have also extended this construct, with the StarSs clauses input, output and inout. This information is used to derive dependencies among tasks at runtime. The syntax of these clauses is: input( data-reference-list ) output( data-reference-list ) inout( data-reference-list ) Dependencies are expressed by means of data-reference-lists, which are a superset of a variable-list. Adata-reference in such a list can contain a variable identifier, but also references to subobjects. References to subobjects include array element references (e.g., a[4]), array sections (a[3:6]), field references (a.b), and elaborated shaping expressions ([10][20] p). For simplicity, details on the syntax used to define subobjects will be introduced in the following examples as well as in sect Implicit tasks created in parallel regions are assumed to be totally independent. It is not allowed to use input, output and inout in a parallel construct. Figure 1 illustrates the use of the extended task construct to parallelize a sequential code that computes the matrix multiplication C = C + A B. In this particular code, the programmer defines each element of matrices A, B and C as a pointer to a block of BS BS floats, which are allocated from inside the main function. Each task corresponds to an instantiation of function gemm, and the programmer uses the inout clause to express the data dependence that exists among tasks computing the same block of C (several tasks will overwrite the same block). In this simple example, since the allocation and initialization of matrices is done sequentially, there is no need to annotate the blocks of matrices A and B with the input clause, as they are not involved in any data dependence. In order to show an example with a richer set of dependencies, we use the LU factorization of a blocked sparse matrix A consisting of NB NB blocks, where some of the blocks are void (i.e., all their entries are zero) while some others are fully populated with nonzero entries (they are dense). This is captured in the data structure that holds the matrix, where we assume that storage is only allocated for the dense blocks. Figure 2 shows the sequential Sparse_LU function annotated with the extended task construct, and Fig. 3 illustrates the dependencies that are encountered during the execution of this code on a particular blocked sparse matrix. Using the task

6 Fig. 1 Matrix multiplication example annotated with our proposed extensions to OpenMP construct, the programmer identifies four types of tasks, which correspond to the invocation of kernels lu_getrf, lu_trsm_right, lu_trsm_left, and lu_gemm. In this case, the kernel call is annotated and the programmer needs to indicate the coordinates of the blocks involved in the operation (e.g., A[k][k] for lu_getrf) and their dimensions ([0 : BS 1][0 : BS 1] in all cases). For kernel lu_gemm, for example, the programmer also specifies that the first and second arguments (A and B) are input parameters (they are only read during the execution of the kernel) while the third argument (C) is inout (it is read and written during the execution of the kernel). Note that all references in the code correspond to blocks of the same matrix, yielding an elaborate dependence graph for this example (see Fig. 3). Note that the dependencies are on the input data (sparse characteristic of the input matrix) and new blocks can be dynamically generated (line 33 in Fig. 2). The annotations are placed on the original sequential version, with no transformations applied to identify the specification of the inherent parallelism available. The runtime employs the information implicit in the graph, transparently to the user/programmer, to extract task parallelism while satisfying the dependencies among tasks. 2.2 Specifying Target Devices: Heterogeneous Cholesky To target heterogeneous systems composed of general-purpose processors and hardware accelerators, we add a StarSs construct that may precede an existing task pragma to OpenMP: #pragma omp target device(device-name-list) [clause-list] The target construct specifies that the execution of the task could be off-loaded to any of the (types of) devices specified in device-name-list (and as such its

7 Fig. 2 Sparse_LU example annotated with our proposed extensions to OpenMP Fig. 3 Footprint of an 5 5 blocked sparse matrix (left) and dependency graph for its sparse LU factorization (right). In the graph square, triangle, diamond and circle shapes correspond to tasks lu_getrf, lu_trsm_right, lu_trsm_left and lu_gemm, respectively

8 Fig. 4 Cholesky example annotated with our proposed extensions to OpenMP code must be handled by the proper compiler backend). If the task is not preceded by a target directive, then the default device-name, which is smp and corresponds to a homogeneous shared-memory multicore architecture, is used. Other device-names are vendor specific. We will use three possible examples in this paper to specify the accelerator: spe for a single SPE of the Cell B.E., cuda for the whole GPU and fpga for the whole FPGA. Two additional clauses can be used with the device pragma: copy_in(data-reference-list) copy_out(data-reference-list) These two clauses, which are ignored for the smp device, specify data movement for shared variables used inside the task. The copy_in clause specifies those variables that must be moved from host memory to device memory. The copy_out clause specifies those variables that need to be moved back from device memory to host memory. Figure 4 shows code that computes the Cholesky factorization of a dense matrix A, consisting of NB NB blocks with dimension BS BS each. The operation is decomposed into four types of tasks: Cholesky factorization of a diagonal block (chol_potrf), triangular solve involving the subdiagonal blocks (chol_trsm), symmetric update of the blocks on the diagonal (chol_syrk), and update of the remaining blocks (chol_gemm). The target construct is used here to specify that all these tasks, except for the factorization of the diagonal block, should be computed on a cuda accelerator (i.e., a GPU). Other vendor-specific clauses in the target construct for each particular device-name are possible. Some restrictions may apply to tasks that target a specific

9 Fig. 5 Matrix multiplication example annotated with our proposed extensions to OpenMP device (for example, they may not contain any other OpenMP directives or do any input/output with some devices). In addition, tasks offloaded to some specific devices should be tied or they should execute on the same type of device if thread switching is allowed. 2.3 Specifying Alternative Implementations: Matrix Multiplication Revisited Tha target construct also offers the implements clause to specify alternative implementations for a taskifyied function that are tailored to specific accelerator devices. The syntax of the clause is: implements(function-name) Using this clause, in the example in Fig. 5, the programmer is specifying three possible options to execute function gemm. The first one uses the original definition of function gemm for the default target architecture. The user also specifies two alternative implementations: gemm_cuda for an NVIDIA GPU; and gemm_spe for the Cell B.E.. For all the devices, the runtime is in charge of moving data before and after the execution of the task. If the original implementation is appropriate for one of the accelerator types, then the programmer should precede the definition of the task with the specification of the target device (as in line 1 of Fig. 5). In this case, the compiler would generate two versions for the same function, one going through the native optimizer for the default device, and another using the accelerator-specific compiler. 2.4 Specifying Array Section Dependences Through the examples so far we have seen dependence specifications of scalar objects or full arrays but our syntax can also specify sections of arrays. Since C/C++ does not have any way to express ranges of an array, we have borrowed the array-section syntax from Fortran 90. An array section is then expressed with a[first:last] meaning all elements of the array a from the first to the last element inclusive. Both, first and last are expressions evaluated at execution time.

10 Fig. 6 Simple example of array sections Fig. 7 Shaping expression examples Figure 6 shows a simple example of array sections where task A fills the bottom half of the array a (i.e., elements 0 to N/2 1) and task B fills the remaining elements (i.e., elements N/2 to N 1). Task C waits until both tasks are finished before executing. For syntactic economy a[:last] is the same as a[0:last].for arrays where the upper bound is known, a[first:] and a[:] mean respectively a[first:n-1] and a[0:n-1]. Designating an array (i.e., a) in a data reference list without an array section or an array subscript is equivalent to the whole array-section (i.e., a[:]). Array sections can also be specified for multidimensional arrays by specifying one section for each dimension (e.g., a[1:2][3:4]). While technically not arrays, array sections can also be applied to pointers: p[ first : last] refers to the elements (p + first) to (p + last). Pointers to arrays can use multidimensional sections but because pointers lack dimensional information, multidimensional sections are not allowed for pointer-to-pointers types without a shaping expression. In order to use multidimensional sections over pointers, the structural information of the array dimensions needs to be restored. A shaping expression serves that purpose. Shaping expressions are a sequence of dimensions, enclosed in square brackets, and a data reference, that should refer to a pointer type: [N]p. For example, in Fig. 7, the input clause on line 3 creates a dependence against the pointer value and not the pointed data. But, using a shaping expression as in line 5 the dependence is against the pointed data. As shown in line 7 shaping expressions enable multidimensional sections over pointers. Array parameters are implicitly converted to pointers types where the outermost dimension is lost in C/C++. Thus, a shaping expression is required to define a dependence to the whole array. This is why in line 10, the input clause creates a dependence

11 Fig. 8 Example using the extended taskwait pragma with the pointer but in line 11 the dependence is created against the matrix stored through the pointer. 2.5 Additional Clause for Taskwait We have also extended the OpenMP taskwait construct, with an on clause from StarSs, as follows: #pragma omp taskwait on(data-reference-list) in order to wait for the termination of those tasks whose output or inout match with data-reference-list. For example, in code shown in Fig. 8, the programmer needs to insert the taskwait pragma in order to ensure that the next statement reads the appropriate value for variable x, which is generated by task A. However, task B and task C can run in parallel with the code after the taskwait pragma. 3 Extensions to the OpenMP Tasking Execution Model The runtime supporting the execution of the StarSs extensions dynamically creates explicit tasks and the memory region specifiers in data-reference-list are used to build a task dependence graph as follows: Data references specified in input or inout clauses are checked against the data-references specified in output or inout clauses of all tasks, in the scope of the same parent task, in execution or pending execution. If there is a match, a true dependence is added between both tasks. Data references specified in the output or inout clauses are checked against data references specified in input, output or inout clauses of all tasks, in the scope of the same parent task, in execution or pending execution. If there is a match, a false dependence appears. This dependence could be eliminated by dynamically renaming the memory region specified in data reference. Renaming is an optional feature of the runtime which can be activated selectively to increase the potential parallelism in the task graph. Avariableinashared data clause, but not in an input, output or inout clause, indicates that the variable is accessed inside the task but it is not affected by any data dependence in the current scope of execution (or is protected by using another mechanism).

12 When a task is ready for execution (i.e., it has no unsatisfied dependencies on previously generated tasks): The runtime can choose among the different available targets to execute the task. This decision is implementation-dependent but it will ideally be tailored to resource availability. If no resource is available, the runtime could stall that task until one becomes available or launch the execution of the original function on the default smp device. The runtime system must copy variables in the copy_in list from the host memory to the device memory. Once the task finishes execution, the runtime must copy back variables in the copy_out list, if necessary. Many optimizations on the way local stores are handled based on the potentially huge level of lookahead that a task graph represents and information about the accessed data is available. It is for example possible to minimize the data transfers by caching data already transferred and scheduling tasks to the accelerators where local copies of their input data are available. The execution of a taskwait forces the write-back to memory for the references in data-reference-list of any possible data renaming that has been dynamically introduced by the runtime system to eliminate false dependences. 3.1 Proof-of-Concept Implementation for Multi-Core The SMP Superscalar (SMPSs) runtime targets homogeneous multicore and shared memory machines (SMP). For this specific implementation the runtime exploits that all cores have coherent load/store access to all data structures and therefore no specific data movement between the main program and the cores executing the tasks are required. Although a main thread is responsible for executing the main program, for adding tasks to the data dependence graph, and for synchronizing all threads, the runtime implements distributed scheduling. All threads execute tasks, including the main thread when it would otherwise be idle. A general ready tasks list is accessible by all threads while private ready lists store the ready tasks that must be executed by each core. To favor the exploitation of memory locality, each thread first processes the tasks inserted in its own ready list, where it also inserts new ready tasks released by the completion of those executed in the same core. Also, to favor load balancing, a thread steals tasks from the main ready list and from lists from other cores when its own is empty. 3.2 Proof-of-Concept Implementation for the Cell B.E. The Cell Superscalar (CellSs) runtime implementation targets the execution on Cell B.E. based platforms. The challenges with this platform are twofold: the heterogeneity of the chip, with one general purpose (and slower) multi-threaded Power-based processor element (PPE) and eight synergistic processor elements (SPEs); and the memory organization with a PPE main memory non coherent with the local memories

13 of the SPEs. Data transfers from the main memory to the small (only 256 KB) of the SPEs must be explicitly programmed with DMA. The CellSs runtime is organized as two threads that run in the PPE (the main thread and the helper thread) and up to sixteen threads that run in the SPEs. 1 The user program starts normal execution in the main thread and whenever an annotated task is called, a new node in the task graph is added with its corresponding dependences. The helper thread is responsible for task scheduling and synchronization with the SPEs. Each SPE thread waits for tasks to be scheduled in its core. To reduce communication overhead and the need for explicit communications, tasks are scheduled in sets to be executed in the same SPE. Within a task set, double buffering (data transfers for the next task are performed concurrently with the execution of the current task), and other optimizations (e.g., the reduction of write transfers when a parameter is written more than once in a task set) can be applied. Extracting data-level parallelism for each SPE (simdization) is left in the hands of the native target compiler programmer and/or the programmer (using intrinsics, for example). 3.3 Proof-of-Concept Implementation for NVIDIA GPUs The GPU SuperScalar (GPUSs) implementation targets the parallelization of applications on platforms consisting of a general-purpose (possible multi-core) processor (the host) connected with several hardware accelerators (the devices). In our prototype implementation, these accelerators are programmable NVIDIA GPUs, which communicate with the host via a PCIExpress bus. Each GPU can only access its own local memory space; direct communication between GPUs is not possible, so that data transfers between them must be performed through the host memory. Our approach considers each accelerator as a single execution unit, which can efficiently execute specialized pieces of code (in our case, CUDA kernels defined by the programmer as tasks). GPUSs is not aware of the internal architecture of the accelerator, and it only exploits task-parallelism by mapping/scheduling the execution of tasks to the hardware accelerators in the system. As in CellSs, extracting data-level parallelism inside a single GPU is left in the hands of the programmer and the native device-specific compiler. The architecture of the GPUSs runtime is similar to that of CellSs. However, there are some important differences between the two, derived from the architectural features of each system. In particular, data transfers between memory spaces through the PCIExpress bus are a major bottleneck in these type of multi-accelerator systems. Therefore, the number of data movements associated with the execution of a given task must be reduced as much as possible to improve performance. To do so, GPUSs views the local store of each accelerator as a cache memory that keeps data blocks recently used by the GPU. The replacement policy (LRU in our current implementation) and the number and size of blocks in the 1 Although Cell B.E. chips have up to eight SPEs (and only six available on the chips that equip the Play- Station 3), the blades usually come with two of those chips and the system enables access to all of them from the Power processors).

14 cache can be easily tuned in the runtime. Inconsistencies between data blocks stored in the caches of the accelerators and the blocks in the host memory are allowed using a write-back memory coherence policy; thus, data blocks written by a GPU are only updated in the host memory when another GPU has to operate on them. Coherence among the local stores of the GPUs is maintained with a write-invalidate policy. The system handles the existence of multiple memory spaces by keeping a memory map of the cache of each accelerator. The translation of addresses between different memory spaces is transparent to the user. Additionally, the information stored in the map is used to reduce data transfers by carefully mapping tasks to the most appropriate execution resource. The basic architecture of the runtime and many of the techniques implemented in GPUSs for NVIDIA multi-accelerator systems can be easily ported to other (homogeneous of heterogeneous) multi-accelerator platforms with similar architectures. 3.4 Proof-of-Concept Implementation for SGI RASC FPGA For FPGA-based systems, offloading a task means sending the bitstream corresponding to the task, configuring the FPGA device and sending/receiving data. In these systems, the time required to offload a bitstream in the device (bitstream file read and device configure) can be significantly high. For example, it takes approximately one second to load a 4 MB bitstream in the SGI RASC blade. Once the bitstream in loaded in memory, reusing it takes just 4 ms. In order to hide this high bitstream loading time, the FPGASs prototype implementation includes a full associative bitstream cache to keep information about the bitstreams currently loaded in the FPGA devices. When a task pragma is found, the runtime checks if the bitstream that implements the task is already configured. A hit in the bitstream cache produces the effective offloading of the task execution. If the runtime detects a miss in the bitstream cache, the runtime will apply a least frequently used (LFU) replacement policy. During the miss, and in order to hide the FPGA configuration time, the runtime launches the execution of the task in the host processor. Once the bitstream is configured, the runtime will detect a hit in the bitstream cache and offload the execution of future instances of that task to the FPGA device. Since the data transfer overhead between the host and the FPGA device is usually an important issue, the runtime system applies data packing and unpacking, according to the different memory associations detected by the compiler. 3.5 Bringing it all Together: Challenges Ahead Previous subsections documented implementations of the model for specific accelerators. These implementations (other than SMPSs) use the main processor mainly as a controlling device to spawn tasks to the accelerators but do not execute tasks on it. With new architectures [4] where a fair amount of the power capacity resides in the main processors this is not a desirable approach. Furthermore, architectures with more than one type of accelerator are already appearing and a runtime that can handle more than one at the same time will be required.

15 Supporting more than one accelerator and being able to run tasks in the main processor(s) is not that difficult. By defining clean interfaces that hide architectural details this becomes more an engineering task than a research one. But, then, things become more interesting. Once a task can be run on any of the processing elements (supposing the user has specified versions for all of them with our extensions) the runtime must decide which one to use. On one hand, the runtime should maximize the usage of all the processing elements of the system. A naive approach could schedule each new task in the fastest element that is not executing anything. But, of course, different tasks will be faster on different processing elements. Because there is usually some repetitiveness in the kind of tasks executed, we think that it will be possible for the runtime to learn, as the execution proceeds, where to send each task. On the other hand, we want to minimize the communication costs. One obvious thing to do is to schedule tasks that use the same data on the same processing element, which will not always be possible, in which case the runtime will need to take into account the data transfer cost from the processor element where they were last used. This is not an easy task, as the communication mechanism may differ from one processor element to another. These two factors (efficiency and reducing communication) will need to be balanced to obtain an optimal scheduling. An important factor to take this decision is the communication-to-computation ratio and it is unclear if the runtime will be able to find the right choice on its own. In any case, further research needs to be conducted on a fully heterogeneous runtime/system to solve this tradeoff. 4 Experimental Results In this section we illustrate the general validity and portability, and promising performance results of the StarSs model, and therefore the proposed extensions to OpenMP, to exploit task-level parallelism in heterogeneous architectures. The Cholesky factorization in Fig. 4 will serve as a case study for SMPSs, CellSs and GPUSs, but we emphasize again that the scope of StarSs is not limited to the parallelization of (dense) linear algebra codes. This operation is appealing in that it is a basic building block for a wide variety of scientific and engineering applications (the Cholesky factorization is the most efficient method to solve certain classes of linear systems of equations). Besides, this factorization shares with many other linear algebra operations a wellknown and regular algorithmic structure, and has been traditionally considered as a starting point for HPC community efforts. In all experiments we consider the cost of the Cholesky factorization to be the standard n 3 /3 floating-point arithmetic operations (flops for short), for square matrices of order n when reporting the rate of computation. For SMPSs, the annotations with the target construct from the code in Fig. 4 are ignored, as all operations are computed on the cores of the shared-memory multiprocessor. The experimental analysis has been performed on a SGI Altix multiprocessor, consisting of 16 nodes equipped with a dual core CPU at 1.6 GHz each (the peak performance of the system is GFLOPS). The memory system presents a cc NUMA organization, with a local RAM being shared by the CPUs in each node, and a SGI NUMAlink interconnection network. The codes employ double precision arithmetic and were linked with the BLAS in Intel s Math Kernel Library (MKL) 9.1.

16 200 Cholesky factorization on 32 cores SMPSs MKL 200 Cholesky factorization of an 8000 x 8000 matrix SMPSs MKL GFLOPS GFLOPS Matrix size Number of cores Fig. 9 Performance (left) and scalability (right) of the SMPSs Cholesky factorization routine This package provides a highly tuned implementation of linear algebra basic building blocks like chol_trsm_right, chol_gemm, and chol_syrk Intel processors. Figure 9 compares the GFLOPS rates attained by the Cholesky factorization routine parallelized with SMPSs and the (traditional) multi-threaded implementation of this routine in MKL. We note the superior performance of the SMPSs code, due to the higher level of parallelism and scalability exposed by the dynamic scheduling mechanism of tasks in SMPSs. The experiments with CellSs were run on a QS22 IBM Blade server, with 2 PowerXCell (the high-performance double-precision floating-point version of the Cell B.E. processor) at 3.2 Ghz, and 12 GB of memory. The results were obtained with CellSs version 2.2 and using the IBM SDK version 3.1. Figure 10 presents the results obtained for the Cholesky factorization with the Cell based platform. The figure on the left shows the absolute performance results when executing with 8 SPUs and varying the matrix size up to floats. The results are compared against a handcoded version, where the graph generation and scheduling of the tasks are performed statically [5]. The lose in performance against this hand-coded example is 19% for large matrix sizes. We consider this a more than reasonable result taking into account that the original hand-coded version has 302 while the CellSs 32 lines. 2 The figure on the right shows the scalability of CellSs from 1 to 8 SPU, using the elapsed time using 1 SPU as the base case. Our next experiment employs GPUSs, the prototype extension of StarSs for platforms with multiple GPUs. The target system is a server with two Intel Xeon QuadCore E5405 (2.0 GHz) processors and 8 GBytes of shared DDR2 RAM memory, connected with an NVIDIA Tesla s870 computing system with four NVIDIA G80 GPUs and 6 GBytes of DDR3 memory (1.5 GBytes per GPU). The Intel 5400 chipset features two PCIExpress Gen2 interfaces connected with the Tesla, which deliver a peak bandwidth of 48 Gbits/second on each interface. We used NVIDIA CUBLAS (version 2.0) built on top of the CUDA API (version 2.0) together with the NVIDIA driver (171.05). Single precision was employed in all experiments. The cost of all data transfers between RAM and GPU memories is included in the timings. Figure 11 illustrates 2 Only accounting code lines, without comments and includes. The source code of the tiles that are the same for both examples are not accounted.

17 GFLOPS Cholesky factorization on 8 SPUs Hand-coded (static scheduling) CellSs Matrix size Speed-up Cholesky factorization of an 4096 x 4096 matrix CellSs Number of cores Fig. 10 Performance (left) and scalability (right) of the Cholesky factorization routine parallelized with CellSs GFLOPS Cholesky factorization on 4 GPUs 400 GPUSs 350 CUBLAS Matrix size Speed-up Cholesky factorization with GPUSs - Scalability GPUs 3 GPUs 2 GPUs 2 1 GPU Matrix size Fig. 11 Performance (left) and scalability (right) of the Cholesky factorization routine parallelized with GPUSs the parallel performance and scalability of the Cholesky factorization codes with our prototype implementation of GPUSs and comparison with the original CUBLAS. 5 Related Work OpenMP [2] grew out of the need to standardize the explosion of parallel programming languages and tools of the 90s. It was initially structured around parallel loops and was meant to handle dense numerical applications. The simplicity of its original interface, the use of a shared-memory model, and the fact that the parallelism of a program is expressed with annotations loosely-coupled to the code, all have helped OpenMP become well-accepted. While recent additions to OpenMP 3.0 [6] accomodate for task parallelism, making it more suitable for irregular codes, OpenMP is not ready for the challenges posed by the new generation of multi-core heterogeneous architectures. The Cell B.E.[7] is likely one the most challenging heterogeneous architectures to program. IBM developed an OpenMP prototype compiler that generates parallel programs under the master-slave programming model. Data transfers between master (PPE) and slaves (SPEs) are transparently introduced employing a software cache, although the compiler can try to optimize for very regular access patterns. Other programming solutions for the Cell B.E. like Sequoia, MPI microtasks, and our own

18 CellSs are more promising in that they target task parallelism, employing higher-level information to perform more complete optimizations. GPUs have traditionally been programmed using specific-domain graphics libraries such as OpenGL or DirectX. NVIDIA was one of the pioneering graphics company to realize the potential of general-purpose graphics processors, and the benefits which could be gained by offering a general-purpose application programming interface (API). The result was CUDA [8], a unified architecture design featuring a programmable graphics pipeline, and an API to develop parallel programs that exploit data-parallelism on this architecture. Unfortunately, in order to develop efficient codes, CUDA programmers (as well as those of Brook+ [9], the data-parallel language for AMD/ATI GPUs) still need to be deeply aware of the underlying architecture. A promising approach that may solve this problem consists in automatically transforming existing OpenMP codes into CUDA. Hardware description languages (e.g., Verilog or VHDL) are employed to develop FPGA-accelerated computational kernels. In order to solve the programmability problem for these devices, extensions to the C language have appeared. Usually, there are two strategies to compile these extensions. In the first strategy, the section of code to be offloaded to the FPGA is translated from C to VHDL. The second strategy is to map a soft processor (e.g. Xilinx Microblaze) into the FPGA, and translate the source code to be executed to the code that this soft processor understands. In both cases, a host/device communication library, such as the RASClib for the SGI RASC architecture, is necessary to offload the execution to one of the FPGAs available, including data movement. PGI [10] and HMPP [11] programming models are two other approaches trying to tackle the accelerator problem with high-level directives. PGI uses compiler technology to offload the execution of loops to the accelerators. HMPP also annotates functions as tasks to be offloaded to the accelerators. We think that StarSs has higher potential in that it shifts part of the intelligence that HMPP and PGI delegate in the compiler to the StarSs runtime system. Although these alternatives do support a fair amount of asynchronous computations expressed as futures or continuations, the level of lookahead they support is limited in practice. In these approaches, synchronization requests (waiting for a given future or selecting among some list of them when the result is needed) have to be explicitly inserted in the main control flow of the program. Besides the additional complexity of the code, this approach implies that certain scheduling decisions are made statically and hardwired in the application code. The approach followed in StarSs exploits much higher levels of lookahead (tens of thousands of tasks) without requiring the programmer to schedule the synchronization operations explicitly and giving much higher flexibility to the runtime to respond to the foreseeable variability of application characteristics and resource availability. 5.1 Tending a Bridge to OpenCL A recent attempt to unify the programming models for general-purpose multi-core architectures and the different types of hardware accelerators (Cell B.E., GPUs, FPGAs, DSPs, etc.) is OpenCL [12]. The participation of silicon vendors (e.g., Intel,

19 IBM, NVIDIA, and AMD) in the definition of this open standard ensures portability, low-level access to the hardware, and supposedly high performance. We believe, however, that OpenCL still exposes much of the low-level details, making it cumbersome to use by non-experts. Finding the appropriate interoperability between the StarSs extensions to OpenMP and OpenCL can be a viable solution. While StarSs targets the exploitation of task-parallelism by mapping/scheduling the execution of tasks to the hardware accelerators in the system, OpenCL could be used to express, in a portable way, the data-level parallelism to be exploited in accelerators by the native device-specific compiler. 6 Conclusions and Future Work In this paper we propose a number of extensions to the OpenMP language, that come from the research StarSs programming model, that try to tackle the programmability of emerging heterogeneous architectures. The first group of these extensions aim to enable a more productive and effective parallelization where the user specifies the data dependences among the different parallel tasks of the computation. The compiler and runtime are then responsible to extract the parallelism of the application and perform the needed synchronizations. The second group specifies on which accelerators a task should run. This information also allows the runtime to generate the code to offload the tasks to the accelerators, and combined with the data dependences, it can also generate the code that takes care of the data movement. Furthermore, a user can specify optimized versions of the task code for a given accelerator to take advantage of already optimized operations (e.g., the BLAS libraries) while maintaining a single source code that is portable across multiple platforms. We have presented the challenges of implementing this model in a number of architectures (e.g., multicores, Cell B.E., GPUs and FPGAs). Our experimental results show that our current prototypes can obtain high performance compared to the amount of effort that the programmer needs to devote. While in some cases there is still a gap with respect to hand-tuned version we expect that further research will reduce it. In particular, our current research focuses on improving the scheduling of tasks to minimize the communication of data across the different processing elements of the system and to increase data locality. Other research directions try to increase the amount of available parallelism by removing false dependences (using renaming as outlined in Sect. 3 and establishing a trade off between memory used in data renaming and parallelism exploited at runtime). Finally, we are in the process of integrating all proof-of-concept implementations into a single runtime and including the intelligence described in Sect. 3.5 to make use of different accelerators at the same time. Further scheduling research is needed to make this efficient. Acknowledgments We would like to thank the helpful and constructive comments that we got from the reviewers of this paper. The researchers at BSC-UPC were supported by the Spanish Ministry of Science and Innovation (contracts no. TIN and CSD ), the European Commission in the context of the SARC project (contract no ) and the HiPEAC2 Network of Excellence (contract no. FP7/IST ), and the MareIncognito project under the BSC-IBM collaboration agreement. The

20 researchers at the Universidad Jaime I were supported by the Spanish Ministry of Science and Innovation/FEDER (contracts no. TIN C02-02 and TIN C04-01) and by the Fundación Caixa-Castellón/Bancaixa (contracts no. P1B and P1B ). References 1. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core 86 architecture for visual computing. ACM. Trans. Graph. 27(3), 1 15 (2008) 2. OpenMP Architecture Review Board.: OpenMP 3.0 Specification. May (2008) 3. Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a Programming Model for the Cell BE Architecture. In : Proceedings of the ACM/IEEE SC 2006 Conference, November (2006) 4. Turner, J.A.: Roadrunner: Heterogeneous Petascale Computing for Predictive Simulation. Technical report, Technical Report LANLUR , Los Alamos National Lab, Las Vegas, NV (2007) 5. Kurzak, J., Buttari, A., Dongarra, J.: Solving systems of linear equations on the cell processor using cholesky factorization. IEEE. Trans. Parallel Distrib. Syst. 19(9), (2008) 6. Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of OpenMP tasks. IEEE. Trans. Parallel Distrib. Syst. 20(3), (2009) 7. Pham, D.C., Aipperspach, T., Boerstler, D., Bolliger, M., Chaudhry, R., Cox, D., Harvey, P., Harvey, P.M., Hofstee, H.P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Pham, M., Pille, J., Posluszny, S., Riley, M., Stasiak, D.L., Suzuoki, M., Takahashi, O., Warnock, J., Weitzel, S., Wendel, D., Yazawa, K.: Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor. IEEE J. Solid-State Circuits 41(1), (2006) 8. NVIDIA : NVIDIA CUDA Compute Unified Device Architecture-Programming Guide (2007) 9. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: Stream Computing on Graphics Hardware. In : SIGGRAPH 04: ACM SIGGRAPH 2004 Papers, pp ACM Press, New York (2004) 10. PGI.: PGI Fortran and C Accelerator Compilers and Programming Model Technology Preview. The Portland Group (2008) 11. Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A Hybrid Multi-core Parallel Programming Environment. In : First Workshop on General Purpose Processing on Graphics Processing Units, October (2007) 12. Khronos OpenCL Working Group.: The OpenCL Specification. Aaftab Munshi, Ed (2009)

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento