Extending OpenMP to Survive the Heterogeneous Multi-Core Era

Size: px
Start display at page:

Download "Extending OpenMP to Survive the Heterogeneous Multi-Core Era"

Transcription

1 DOI /s Extending OpenMP to Survive the Heterogeneous Multi-Core Era Eduard Ayguadé Rosa M. Badia Pieter Bellens Daniel Cabrera Alejandro Duran Roger Ferrer Marc Gonzàlez Francisco Igual Daniel Jiménez-González Jesús Labarta Luis Martinell Xavier Martorell Rafael Mayo Josep M. Pérez Judit Planas Enrique S. Quintana-Ortí Received: 27 April 2010 / Accepted: 27 April 2010 Springer Science+Business Media, LLC 2010 Abstract This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific E. Ayguadé (B) R. M. Badia P. Bellens D. Cabrera A. Duran R. Ferrer M. Gonzàlez D. Jiménez-González J. Labarta L. Martinell X. Martorell J. M. Pérez J. Planas Barcelona Supercomputing Center (Centro Nacional de Supercomputación (BSC-CNS)), Barcelona, Spain eduard.ayguade@bsc.es P. Bellens pieter.bellens@bsc.es D. Cabrera daniel.cabrera@bsc.es A. Duran alex.duran@bsc.es R. Ferrer roger.ferrer@bsc.es M. Gonzàlez marc.gonzalez@bsc.es D. Jiménez-González daniel.jimenez@bsc.es J. Labarta jesus.labarta@bsc.es L. Martinell luis.martinell@bsc.es X. Martorell xavier.martorell@bsc.es

2 code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer. Keywords Parallel computing Programming models Runtime systems Task-level parallelism Multi-core processors Hardware accelerators Heterogeneous computing 1 Introduction In response to the combined hurdles of power dissipation, large memory latency, and little instruction-level parallelism left to be exploited, all major hardware manufacturers have adopted the replication of cores on-chip as the mainstream path to deliver higher performance [1]. Today, chips with a few general-purpose, fully-functional cores are available from Intel (2 6 cores), AMD (2 4 cores), or Sun (8 cores), to name a few, and the number of cores is expected to increase with each shrink of the process technology. Chips in the near future will potentially integrate hundreds or thousands of cores. Graphics processors (GPUs) from NVIDIA and AMD/ATI, on the other hand, are already in the many-core era, featuring hundreds of fine-grained stream cores per processor (up to 240 and 320 in their latest designs, respectively). Together with GPUs, hardware accelerators like the heterogeneous IBM/Sony/Toshiba Cell B.E., Clear- Speed ASICs or the FPGAs from multiple vendors are appealing in that, compared with J. M. Pérez josep.m.perez@bsc.es J. Planas judit.planas@bsc.es E. Ayguadé M. Gonzàlez D. Jiménez-González J. Labarta X. Martorell Depto. de Arquitectura de Computadores, Universitat Politècnica de Catalunya, Barcelona, Spain R. M. Badia IIIA, Artificial Intelligence Research Institute, CSIC, Spanish National Research Council, Madrid, Spain rosa.m.badia@bsc.es F. Igual R. Mayo E. S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I (UJI), Castellon, Spain F. Igual figual@icc.uji.es R. Mayo mayo@icc.uji.es E. S. Quintana-Ortí quintana@icc.uji.es

3 general-purpose multi-core processors, they offer much higher performance-cost and performance-power ratios for certain applications. Therefore, we envision a future of heterogeneous architectures, equipped with a few coarse-grain general-purpose cores, and several accelerators, probably of a different nature. Applications will be solved on the most appropriate technology, while other parts of the processor may be turned off, to decrease the power used by the chip. While the increasing number of transistors on-chip dictated by Moore s Law indicates that many-core processors, with a blend of heterogeneous technologies, will be feasible in the near future, availability of applications for these architectures, and more specifically, programming models will really determine the success or failure of these designs. How easy it is to develop programs that efficiently exploit parallelism at all levels (instruction, data, task, and node) in these parallel processors is the key to the future. The heterogeneous nature of these systems, and the existence of multiple separate address spaces, only exacerbates the height of the programmability wall. The majority of proposals in this current Tower of Babel era assume a hostdirected programming and execution model with attached accelerator devices. The bulk of the user s application executes on the host while user-specified code regions are offloaded to the accelerator. In general, the specifics of the different accelerator architectures makes programming extremely difficult if one plans to use the vendorprovided SDKs (e.g., libspe for Cell B.E. or CUDA for NVIDIA GPUs). It would be desirable to retain most of the advantages of using these SDKs, but in a productive and portable manner, avoiding the mix of hardware specific code (for task offloading, data movement, etc.) with application code. The recent attempt of OpenCL to unify the programming models for architectures based on hardware accelerators tries to ensure portability, low-level access to the hardware, and supposedly high performance. We believe, however, that OpenCL still exposes much of the low-level details, making it cumbersome to use for non-experts. OpenMP [2] survived the explosion of parallel programming languages of the 90s to become the standard for parallelizing regular applications on shared-memory multiprocessors. While recent additions to OpenMP 3.0 accommodate task parallelism, making it more suitable for irregular codes, OpenMP is not ready for the challenges posed by the new generation of multi-core heterogeneous architectures. Star Superscalar (StarSs) [3] is a promising programming model in this direction that we have used as the starting point to propose a set of extensions to OpenMP. StarSs has its roots in the runtime exploitation of task parallelism, with special emphasis on portability, simplicity and flexibility. Functions that are suitable for parallel execution are annotated as tasks, and their arguments are tagged with their directionality (input, output or both). This information is provided using a simple OpenMP-like annotation of the source code (e.g., pragmas in C/C++). This information is used at runtime to build a task dependence graph dynamically. This graph is one of the main elements used by the runtime to schedule tasks as soon as all their dependences are honored and the appropriate resource to execute them is available. For those architectures with local memories, the runtime also takes care of moving in/out the associated data. The scheduler may also be driven by data locality policies; other target-dependent optimizations of the scheduler can also be incorporated into the general framework of StarSs. Current instantiations of the StarSs programming model and tools

4 include GRIDSs (for the Grid), CellSs (for the Cell B.E.), SMPSs (for general-purpose multi-core processors) and GPUSs (for platforms with multiple GPUs). We are currently extending it to cover platforms with multiple generic hardware accelerators and FPGAs. 2 The StarSs Extensions to OpenMP OpenMP has been traditionally employed to exploit loop-based parallelism, present in most regular scientific and engineering applications, on shared-memory multiprocessors. Recently, OpenMP 3.0 has been extended with tasks, or deferrable units of work, to accommodate irregular applications also. In particular, in OpenMP 3.0 the programmer can specify tasks, and later ensure that all the tasks defined up to some point have finished. Tasks are generated in the context of a team of threads, while the parallel construct creates such team. A task is created when the code reaches the task construct, defined as follows: #pragma omp task [clause-list] structured-block Valid clauses for this construct are untied, if, shared, private and firstprivate. Theuntied clause specifies that the task can be resumed by a different thread after a possible task switching point; when the expression in the if clause evaluates to false, the encountering thread suspends the current task region and begins execution of the new task immediately. The last three clauses are used for setting data sharing attributes of variables in the task body, and have the following syntax: shared( variable-list ) private( variable-list ) firstprivate( variable-list ) where variable-list is a list of identifiers. Naming a variable inside a data sharing clause explicitly sets the data sharing attribute for the variable in the task construct. References in the task construct to variables for which the data sharing attribute is private or firstprivate do not refer to the original variable but to a private storage of the task. Variables annotated with firstprivate, in addition, will have such storage initialized with the value of the original variable when the program execution reaches the task construct. References to variables for which the data sharing attribute is shared refer to the original variable. StarSs extends the task mechanism in OpenMP 3.0 to allow the specification of dependencies between tasks and to map the execution of certain tasks to a type of hardware accelerator (a device). StarSs considers each accelerator (e.g., a SPE, a GPU, a FPGA) as a single execution unit, which can efficiently execute specialized pieces of code. The runtime implementation of the model isolates the user from all the complexities related to task scheduling and offloading. The StarSs extensions are orthogonal to other possible extensions to generate efficient code by a compiler (e.g., vectorization width, number of threads running on accelerators and code transformations). In the next sections, we explore how these extensions could be mapped to the OpenMP language.

5 2.1 Taskifying Functions and Expressing Dependencies StarSs can specify that a function should be executed as a task. To allow this, we have extended the OpenMP task construct to annotate functions in addition to structured blocks: #pragma omp task [clause-list] {function-declaration function-definition structured-block} Whenever the program calls a function annotated in this way, the runtime will create an explicit task. Although this seems to be a simple and naive extension, it associates an implicit name to the task that will be used later in sect We have also extended this construct, with the StarSs clauses input, output and inout. This information is used to derive dependencies among tasks at runtime. The syntax of these clauses is: input( data-reference-list ) output( data-reference-list ) inout( data-reference-list ) Dependencies are expressed by means of data-reference-lists, which are a superset of a variable-list. Adata-reference in such a list can contain a variable identifier, but also references to subobjects. References to subobjects include array element references (e.g., a[4]), array sections (a[3:6]), field references (a.b), and elaborated shaping expressions ([10][20] p). For simplicity, details on the syntax used to define subobjects will be introduced in the following examples as well as in sect Implicit tasks created in parallel regions are assumed to be totally independent. It is not allowed to use input, output and inout in a parallel construct. Figure 1 illustrates the use of the extended task construct to parallelize a sequential code that computes the matrix multiplication C = C + A B. In this particular code, the programmer defines each element of matrices A, B and C as a pointer to a block of BS BS floats, which are allocated from inside the main function. Each task corresponds to an instantiation of function gemm, and the programmer uses the inout clause to express the data dependence that exists among tasks computing the same block of C (several tasks will overwrite the same block). In this simple example, since the allocation and initialization of matrices is done sequentially, there is no need to annotate the blocks of matrices A and B with the input clause, as they are not involved in any data dependence. In order to show an example with a richer set of dependencies, we use the LU factorization of a blocked sparse matrix A consisting of NB NB blocks, where some of the blocks are void (i.e., all their entries are zero) while some others are fully populated with nonzero entries (they are dense). This is captured in the data structure that holds the matrix, where we assume that storage is only allocated for the dense blocks. Figure 2 shows the sequential Sparse_LU function annotated with the extended task construct, and Fig. 3 illustrates the dependencies that are encountered during the execution of this code on a particular blocked sparse matrix. Using the task

6 Fig. 1 Matrix multiplication example annotated with our proposed extensions to OpenMP construct, the programmer identifies four types of tasks, which correspond to the invocation of kernels lu_getrf, lu_trsm_right, lu_trsm_left, and lu_gemm. In this case, the kernel call is annotated and the programmer needs to indicate the coordinates of the blocks involved in the operation (e.g., A[k][k] for lu_getrf) and their dimensions ([0 : BS 1][0 : BS 1] in all cases). For kernel lu_gemm, for example, the programmer also specifies that the first and second arguments (A and B) are input parameters (they are only read during the execution of the kernel) while the third argument (C) is inout (it is read and written during the execution of the kernel). Note that all references in the code correspond to blocks of the same matrix, yielding an elaborate dependence graph for this example (see Fig. 3). Note that the dependencies are on the input data (sparse characteristic of the input matrix) and new blocks can be dynamically generated (line 33 in Fig. 2). The annotations are placed on the original sequential version, with no transformations applied to identify the specification of the inherent parallelism available. The runtime employs the information implicit in the graph, transparently to the user/programmer, to extract task parallelism while satisfying the dependencies among tasks. 2.2 Specifying Target Devices: Heterogeneous Cholesky To target heterogeneous systems composed of general-purpose processors and hardware accelerators, we add a StarSs construct that may precede an existing task pragma to OpenMP: #pragma omp target device(device-name-list) [clause-list] The target construct specifies that the execution of the task could be off-loaded to any of the (types of) devices specified in device-name-list (and as such its

7 Fig. 2 Sparse_LU example annotated with our proposed extensions to OpenMP Fig. 3 Footprint of an 5 5 blocked sparse matrix (left) and dependency graph for its sparse LU factorization (right). In the graph square, triangle, diamond and circle shapes correspond to tasks lu_getrf, lu_trsm_right, lu_trsm_left and lu_gemm, respectively

8 Fig. 4 Cholesky example annotated with our proposed extensions to OpenMP code must be handled by the proper compiler backend). If the task is not preceded by a target directive, then the default device-name, which is smp and corresponds to a homogeneous shared-memory multicore architecture, is used. Other device-names are vendor specific. We will use three possible examples in this paper to specify the accelerator: spe for a single SPE of the Cell B.E., cuda for the whole GPU and fpga for the whole FPGA. Two additional clauses can be used with the device pragma: copy_in(data-reference-list) copy_out(data-reference-list) These two clauses, which are ignored for the smp device, specify data movement for shared variables used inside the task. The copy_in clause specifies those variables that must be moved from host memory to device memory. The copy_out clause specifies those variables that need to be moved back from device memory to host memory. Figure 4 shows code that computes the Cholesky factorization of a dense matrix A, consisting of NB NB blocks with dimension BS BS each. The operation is decomposed into four types of tasks: Cholesky factorization of a diagonal block (chol_potrf), triangular solve involving the subdiagonal blocks (chol_trsm), symmetric update of the blocks on the diagonal (chol_syrk), and update of the remaining blocks (chol_gemm). The target construct is used here to specify that all these tasks, except for the factorization of the diagonal block, should be computed on a cuda accelerator (i.e., a GPU). Other vendor-specific clauses in the target construct for each particular device-name are possible. Some restrictions may apply to tasks that target a specific

9 Fig. 5 Matrix multiplication example annotated with our proposed extensions to OpenMP device (for example, they may not contain any other OpenMP directives or do any input/output with some devices). In addition, tasks offloaded to some specific devices should be tied or they should execute on the same type of device if thread switching is allowed. 2.3 Specifying Alternative Implementations: Matrix Multiplication Revisited Tha target construct also offers the implements clause to specify alternative implementations for a taskifyied function that are tailored to specific accelerator devices. The syntax of the clause is: implements(function-name) Using this clause, in the example in Fig. 5, the programmer is specifying three possible options to execute function gemm. The first one uses the original definition of function gemm for the default target architecture. The user also specifies two alternative implementations: gemm_cuda for an NVIDIA GPU; and gemm_spe for the Cell B.E.. For all the devices, the runtime is in charge of moving data before and after the execution of the task. If the original implementation is appropriate for one of the accelerator types, then the programmer should precede the definition of the task with the specification of the target device (as in line 1 of Fig. 5). In this case, the compiler would generate two versions for the same function, one going through the native optimizer for the default device, and another using the accelerator-specific compiler. 2.4 Specifying Array Section Dependences Through the examples so far we have seen dependence specifications of scalar objects or full arrays but our syntax can also specify sections of arrays. Since C/C++ does not have any way to express ranges of an array, we have borrowed the array-section syntax from Fortran 90. An array section is then expressed with a[first:last] meaning all elements of the array a from the first to the last element inclusive. Both, first and last are expressions evaluated at execution time.

10 Fig. 6 Simple example of array sections Fig. 7 Shaping expression examples Figure 6 shows a simple example of array sections where task A fills the bottom half of the array a (i.e., elements 0 to N/2 1) and task B fills the remaining elements (i.e., elements N/2 to N 1). Task C waits until both tasks are finished before executing. For syntactic economy a[:last] is the same as a[0:last].for arrays where the upper bound is known, a[first:] and a[:] mean respectively a[first:n-1] and a[0:n-1]. Designating an array (i.e., a) in a data reference list without an array section or an array subscript is equivalent to the whole array-section (i.e., a[:]). Array sections can also be specified for multidimensional arrays by specifying one section for each dimension (e.g., a[1:2][3:4]). While technically not arrays, array sections can also be applied to pointers: p[ first : last] refers to the elements (p + first) to (p + last). Pointers to arrays can use multidimensional sections but because pointers lack dimensional information, multidimensional sections are not allowed for pointer-to-pointers types without a shaping expression. In order to use multidimensional sections over pointers, the structural information of the array dimensions needs to be restored. A shaping expression serves that purpose. Shaping expressions are a sequence of dimensions, enclosed in square brackets, and a data reference, that should refer to a pointer type: [N]p. For example, in Fig. 7, the input clause on line 3 creates a dependence against the pointer value and not the pointed data. But, using a shaping expression as in line 5 the dependence is against the pointed data. As shown in line 7 shaping expressions enable multidimensional sections over pointers. Array parameters are implicitly converted to pointers types where the outermost dimension is lost in C/C++. Thus, a shaping expression is required to define a dependence to the whole array. This is why in line 10, the input clause creates a dependence

11 Fig. 8 Example using the extended taskwait pragma with the pointer but in line 11 the dependence is created against the matrix stored through the pointer. 2.5 Additional Clause for Taskwait We have also extended the OpenMP taskwait construct, with an on clause from StarSs, as follows: #pragma omp taskwait on(data-reference-list) in order to wait for the termination of those tasks whose output or inout match with data-reference-list. For example, in code shown in Fig. 8, the programmer needs to insert the taskwait pragma in order to ensure that the next statement reads the appropriate value for variable x, which is generated by task A. However, task B and task C can run in parallel with the code after the taskwait pragma. 3 Extensions to the OpenMP Tasking Execution Model The runtime supporting the execution of the StarSs extensions dynamically creates explicit tasks and the memory region specifiers in data-reference-list are used to build a task dependence graph as follows: Data references specified in input or inout clauses are checked against the data-references specified in output or inout clauses of all tasks, in the scope of the same parent task, in execution or pending execution. If there is a match, a true dependence is added between both tasks. Data references specified in the output or inout clauses are checked against data references specified in input, output or inout clauses of all tasks, in the scope of the same parent task, in execution or pending execution. If there is a match, a false dependence appears. This dependence could be eliminated by dynamically renaming the memory region specified in data reference. Renaming is an optional feature of the runtime which can be activated selectively to increase the potential parallelism in the task graph. Avariableinashared data clause, but not in an input, output or inout clause, indicates that the variable is accessed inside the task but it is not affected by any data dependence in the current scope of execution (or is protected by using another mechanism).

12 When a task is ready for execution (i.e., it has no unsatisfied dependencies on previously generated tasks): The runtime can choose among the different available targets to execute the task. This decision is implementation-dependent but it will ideally be tailored to resource availability. If no resource is available, the runtime could stall that task until one becomes available or launch the execution of the original function on the default smp device. The runtime system must copy variables in the copy_in list from the host memory to the device memory. Once the task finishes execution, the runtime must copy back variables in the copy_out list, if necessary. Many optimizations on the way local stores are handled based on the potentially huge level of lookahead that a task graph represents and information about the accessed data is available. It is for example possible to minimize the data transfers by caching data already transferred and scheduling tasks to the accelerators where local copies of their input data are available. The execution of a taskwait forces the write-back to memory for the references in data-reference-list of any possible data renaming that has been dynamically introduced by the runtime system to eliminate false dependences. 3.1 Proof-of-Concept Implementation for Multi-Core The SMP Superscalar (SMPSs) runtime targets homogeneous multicore and shared memory machines (SMP). For this specific implementation the runtime exploits that all cores have coherent load/store access to all data structures and therefore no specific data movement between the main program and the cores executing the tasks are required. Although a main thread is responsible for executing the main program, for adding tasks to the data dependence graph, and for synchronizing all threads, the runtime implements distributed scheduling. All threads execute tasks, including the main thread when it would otherwise be idle. A general ready tasks list is accessible by all threads while private ready lists store the ready tasks that must be executed by each core. To favor the exploitation of memory locality, each thread first processes the tasks inserted in its own ready list, where it also inserts new ready tasks released by the completion of those executed in the same core. Also, to favor load balancing, a thread steals tasks from the main ready list and from lists from other cores when its own is empty. 3.2 Proof-of-Concept Implementation for the Cell B.E. The Cell Superscalar (CellSs) runtime implementation targets the execution on Cell B.E. based platforms. The challenges with this platform are twofold: the heterogeneity of the chip, with one general purpose (and slower) multi-threaded Power-based processor element (PPE) and eight synergistic processor elements (SPEs); and the memory organization with a PPE main memory non coherent with the local memories

13 of the SPEs. Data transfers from the main memory to the small (only 256 KB) of the SPEs must be explicitly programmed with DMA. The CellSs runtime is organized as two threads that run in the PPE (the main thread and the helper thread) and up to sixteen threads that run in the SPEs. 1 The user program starts normal execution in the main thread and whenever an annotated task is called, a new node in the task graph is added with its corresponding dependences. The helper thread is responsible for task scheduling and synchronization with the SPEs. Each SPE thread waits for tasks to be scheduled in its core. To reduce communication overhead and the need for explicit communications, tasks are scheduled in sets to be executed in the same SPE. Within a task set, double buffering (data transfers for the next task are performed concurrently with the execution of the current task), and other optimizations (e.g., the reduction of write transfers when a parameter is written more than once in a task set) can be applied. Extracting data-level parallelism for each SPE (simdization) is left in the hands of the native target compiler programmer and/or the programmer (using intrinsics, for example). 3.3 Proof-of-Concept Implementation for NVIDIA GPUs The GPU SuperScalar (GPUSs) implementation targets the parallelization of applications on platforms consisting of a general-purpose (possible multi-core) processor (the host) connected with several hardware accelerators (the devices). In our prototype implementation, these accelerators are programmable NVIDIA GPUs, which communicate with the host via a PCIExpress bus. Each GPU can only access its own local memory space; direct communication between GPUs is not possible, so that data transfers between them must be performed through the host memory. Our approach considers each accelerator as a single execution unit, which can efficiently execute specialized pieces of code (in our case, CUDA kernels defined by the programmer as tasks). GPUSs is not aware of the internal architecture of the accelerator, and it only exploits task-parallelism by mapping/scheduling the execution of tasks to the hardware accelerators in the system. As in CellSs, extracting data-level parallelism inside a single GPU is left in the hands of the programmer and the native device-specific compiler. The architecture of the GPUSs runtime is similar to that of CellSs. However, there are some important differences between the two, derived from the architectural features of each system. In particular, data transfers between memory spaces through the PCIExpress bus are a major bottleneck in these type of multi-accelerator systems. Therefore, the number of data movements associated with the execution of a given task must be reduced as much as possible to improve performance. To do so, GPUSs views the local store of each accelerator as a cache memory that keeps data blocks recently used by the GPU. The replacement policy (LRU in our current implementation) and the number and size of blocks in the 1 Although Cell B.E. chips have up to eight SPEs (and only six available on the chips that equip the Play- Station 3), the blades usually come with two of those chips and the system enables access to all of them from the Power processors).

14 cache can be easily tuned in the runtime. Inconsistencies between data blocks stored in the caches of the accelerators and the blocks in the host memory are allowed using a write-back memory coherence policy; thus, data blocks written by a GPU are only updated in the host memory when another GPU has to operate on them. Coherence among the local stores of the GPUs is maintained with a write-invalidate policy. The system handles the existence of multiple memory spaces by keeping a memory map of the cache of each accelerator. The translation of addresses between different memory spaces is transparent to the user. Additionally, the information stored in the map is used to reduce data transfers by carefully mapping tasks to the most appropriate execution resource. The basic architecture of the runtime and many of the techniques implemented in GPUSs for NVIDIA multi-accelerator systems can be easily ported to other (homogeneous of heterogeneous) multi-accelerator platforms with similar architectures. 3.4 Proof-of-Concept Implementation for SGI RASC FPGA For FPGA-based systems, offloading a task means sending the bitstream corresponding to the task, configuring the FPGA device and sending/receiving data. In these systems, the time required to offload a bitstream in the device (bitstream file read and device configure) can be significantly high. For example, it takes approximately one second to load a 4 MB bitstream in the SGI RASC blade. Once the bitstream in loaded in memory, reusing it takes just 4 ms. In order to hide this high bitstream loading time, the FPGASs prototype implementation includes a full associative bitstream cache to keep information about the bitstreams currently loaded in the FPGA devices. When a task pragma is found, the runtime checks if the bitstream that implements the task is already configured. A hit in the bitstream cache produces the effective offloading of the task execution. If the runtime detects a miss in the bitstream cache, the runtime will apply a least frequently used (LFU) replacement policy. During the miss, and in order to hide the FPGA configuration time, the runtime launches the execution of the task in the host processor. Once the bitstream is configured, the runtime will detect a hit in the bitstream cache and offload the execution of future instances of that task to the FPGA device. Since the data transfer overhead between the host and the FPGA device is usually an important issue, the runtime system applies data packing and unpacking, according to the different memory associations detected by the compiler. 3.5 Bringing it all Together: Challenges Ahead Previous subsections documented implementations of the model for specific accelerators. These implementations (other than SMPSs) use the main processor mainly as a controlling device to spawn tasks to the accelerators but do not execute tasks on it. With new architectures [4] where a fair amount of the power capacity resides in the main processors this is not a desirable approach. Furthermore, architectures with more than one type of accelerator are already appearing and a runtime that can handle more than one at the same time will be required.

15 Supporting more than one accelerator and being able to run tasks in the main processor(s) is not that difficult. By defining clean interfaces that hide architectural details this becomes more an engineering task than a research one. But, then, things become more interesting. Once a task can be run on any of the processing elements (supposing the user has specified versions for all of them with our extensions) the runtime must decide which one to use. On one hand, the runtime should maximize the usage of all the processing elements of the system. A naive approach could schedule each new task in the fastest element that is not executing anything. But, of course, different tasks will be faster on different processing elements. Because there is usually some repetitiveness in the kind of tasks executed, we think that it will be possible for the runtime to learn, as the execution proceeds, where to send each task. On the other hand, we want to minimize the communication costs. One obvious thing to do is to schedule tasks that use the same data on the same processing element, which will not always be possible, in which case the runtime will need to take into account the data transfer cost from the processor element where they were last used. This is not an easy task, as the communication mechanism may differ from one processor element to another. These two factors (efficiency and reducing communication) will need to be balanced to obtain an optimal scheduling. An important factor to take this decision is the communication-to-computation ratio and it is unclear if the runtime will be able to find the right choice on its own. In any case, further research needs to be conducted on a fully heterogeneous runtime/system to solve this tradeoff. 4 Experimental Results In this section we illustrate the general validity and portability, and promising performance results of the StarSs model, and therefore the proposed extensions to OpenMP, to exploit task-level parallelism in heterogeneous architectures. The Cholesky factorization in Fig. 4 will serve as a case study for SMPSs, CellSs and GPUSs, but we emphasize again that the scope of StarSs is not limited to the parallelization of (dense) linear algebra codes. This operation is appealing in that it is a basic building block for a wide variety of scientific and engineering applications (the Cholesky factorization is the most efficient method to solve certain classes of linear systems of equations). Besides, this factorization shares with many other linear algebra operations a wellknown and regular algorithmic structure, and has been traditionally considered as a starting point for HPC community efforts. In all experiments we consider the cost of the Cholesky factorization to be the standard n 3 /3 floating-point arithmetic operations (flops for short), for square matrices of order n when reporting the rate of computation. For SMPSs, the annotations with the target construct from the code in Fig. 4 are ignored, as all operations are computed on the cores of the shared-memory multiprocessor. The experimental analysis has been performed on a SGI Altix multiprocessor, consisting of 16 nodes equipped with a dual core CPU at 1.6 GHz each (the peak performance of the system is GFLOPS). The memory system presents a cc NUMA organization, with a local RAM being shared by the CPUs in each node, and a SGI NUMAlink interconnection network. The codes employ double precision arithmetic and were linked with the BLAS in Intel s Math Kernel Library (MKL) 9.1.

16 200 Cholesky factorization on 32 cores SMPSs MKL 200 Cholesky factorization of an 8000 x 8000 matrix SMPSs MKL GFLOPS GFLOPS Matrix size Number of cores Fig. 9 Performance (left) and scalability (right) of the SMPSs Cholesky factorization routine This package provides a highly tuned implementation of linear algebra basic building blocks like chol_trsm_right, chol_gemm, and chol_syrk Intel processors. Figure 9 compares the GFLOPS rates attained by the Cholesky factorization routine parallelized with SMPSs and the (traditional) multi-threaded implementation of this routine in MKL. We note the superior performance of the SMPSs code, due to the higher level of parallelism and scalability exposed by the dynamic scheduling mechanism of tasks in SMPSs. The experiments with CellSs were run on a QS22 IBM Blade server, with 2 PowerXCell (the high-performance double-precision floating-point version of the Cell B.E. processor) at 3.2 Ghz, and 12 GB of memory. The results were obtained with CellSs version 2.2 and using the IBM SDK version 3.1. Figure 10 presents the results obtained for the Cholesky factorization with the Cell based platform. The figure on the left shows the absolute performance results when executing with 8 SPUs and varying the matrix size up to floats. The results are compared against a handcoded version, where the graph generation and scheduling of the tasks are performed statically [5]. The lose in performance against this hand-coded example is 19% for large matrix sizes. We consider this a more than reasonable result taking into account that the original hand-coded version has 302 while the CellSs 32 lines. 2 The figure on the right shows the scalability of CellSs from 1 to 8 SPU, using the elapsed time using 1 SPU as the base case. Our next experiment employs GPUSs, the prototype extension of StarSs for platforms with multiple GPUs. The target system is a server with two Intel Xeon QuadCore E5405 (2.0 GHz) processors and 8 GBytes of shared DDR2 RAM memory, connected with an NVIDIA Tesla s870 computing system with four NVIDIA G80 GPUs and 6 GBytes of DDR3 memory (1.5 GBytes per GPU). The Intel 5400 chipset features two PCIExpress Gen2 interfaces connected with the Tesla, which deliver a peak bandwidth of 48 Gbits/second on each interface. We used NVIDIA CUBLAS (version 2.0) built on top of the CUDA API (version 2.0) together with the NVIDIA driver (171.05). Single precision was employed in all experiments. The cost of all data transfers between RAM and GPU memories is included in the timings. Figure 11 illustrates 2 Only accounting code lines, without comments and includes. The source code of the tiles that are the same for both examples are not accounted.

17 GFLOPS Cholesky factorization on 8 SPUs Hand-coded (static scheduling) CellSs Matrix size Speed-up Cholesky factorization of an 4096 x 4096 matrix CellSs Number of cores Fig. 10 Performance (left) and scalability (right) of the Cholesky factorization routine parallelized with CellSs GFLOPS Cholesky factorization on 4 GPUs 400 GPUSs 350 CUBLAS Matrix size Speed-up Cholesky factorization with GPUSs - Scalability GPUs 3 GPUs 2 GPUs 2 1 GPU Matrix size Fig. 11 Performance (left) and scalability (right) of the Cholesky factorization routine parallelized with GPUSs the parallel performance and scalability of the Cholesky factorization codes with our prototype implementation of GPUSs and comparison with the original CUBLAS. 5 Related Work OpenMP [2] grew out of the need to standardize the explosion of parallel programming languages and tools of the 90s. It was initially structured around parallel loops and was meant to handle dense numerical applications. The simplicity of its original interface, the use of a shared-memory model, and the fact that the parallelism of a program is expressed with annotations loosely-coupled to the code, all have helped OpenMP become well-accepted. While recent additions to OpenMP 3.0 [6] accomodate for task parallelism, making it more suitable for irregular codes, OpenMP is not ready for the challenges posed by the new generation of multi-core heterogeneous architectures. The Cell B.E.[7] is likely one the most challenging heterogeneous architectures to program. IBM developed an OpenMP prototype compiler that generates parallel programs under the master-slave programming model. Data transfers between master (PPE) and slaves (SPEs) are transparently introduced employing a software cache, although the compiler can try to optimize for very regular access patterns. Other programming solutions for the Cell B.E. like Sequoia, MPI microtasks, and our own

18 CellSs are more promising in that they target task parallelism, employing higher-level information to perform more complete optimizations. GPUs have traditionally been programmed using specific-domain graphics libraries such as OpenGL or DirectX. NVIDIA was one of the pioneering graphics company to realize the potential of general-purpose graphics processors, and the benefits which could be gained by offering a general-purpose application programming interface (API). The result was CUDA [8], a unified architecture design featuring a programmable graphics pipeline, and an API to develop parallel programs that exploit data-parallelism on this architecture. Unfortunately, in order to develop efficient codes, CUDA programmers (as well as those of Brook+ [9], the data-parallel language for AMD/ATI GPUs) still need to be deeply aware of the underlying architecture. A promising approach that may solve this problem consists in automatically transforming existing OpenMP codes into CUDA. Hardware description languages (e.g., Verilog or VHDL) are employed to develop FPGA-accelerated computational kernels. In order to solve the programmability problem for these devices, extensions to the C language have appeared. Usually, there are two strategies to compile these extensions. In the first strategy, the section of code to be offloaded to the FPGA is translated from C to VHDL. The second strategy is to map a soft processor (e.g. Xilinx Microblaze) into the FPGA, and translate the source code to be executed to the code that this soft processor understands. In both cases, a host/device communication library, such as the RASClib for the SGI RASC architecture, is necessary to offload the execution to one of the FPGAs available, including data movement. PGI [10] and HMPP [11] programming models are two other approaches trying to tackle the accelerator problem with high-level directives. PGI uses compiler technology to offload the execution of loops to the accelerators. HMPP also annotates functions as tasks to be offloaded to the accelerators. We think that StarSs has higher potential in that it shifts part of the intelligence that HMPP and PGI delegate in the compiler to the StarSs runtime system. Although these alternatives do support a fair amount of asynchronous computations expressed as futures or continuations, the level of lookahead they support is limited in practice. In these approaches, synchronization requests (waiting for a given future or selecting among some list of them when the result is needed) have to be explicitly inserted in the main control flow of the program. Besides the additional complexity of the code, this approach implies that certain scheduling decisions are made statically and hardwired in the application code. The approach followed in StarSs exploits much higher levels of lookahead (tens of thousands of tasks) without requiring the programmer to schedule the synchronization operations explicitly and giving much higher flexibility to the runtime to respond to the foreseeable variability of application characteristics and resource availability. 5.1 Tending a Bridge to OpenCL A recent attempt to unify the programming models for general-purpose multi-core architectures and the different types of hardware accelerators (Cell B.E., GPUs, FPGAs, DSPs, etc.) is OpenCL [12]. The participation of silicon vendors (e.g., Intel,

19 IBM, NVIDIA, and AMD) in the definition of this open standard ensures portability, low-level access to the hardware, and supposedly high performance. We believe, however, that OpenCL still exposes much of the low-level details, making it cumbersome to use by non-experts. Finding the appropriate interoperability between the StarSs extensions to OpenMP and OpenCL can be a viable solution. While StarSs targets the exploitation of task-parallelism by mapping/scheduling the execution of tasks to the hardware accelerators in the system, OpenCL could be used to express, in a portable way, the data-level parallelism to be exploited in accelerators by the native device-specific compiler. 6 Conclusions and Future Work In this paper we propose a number of extensions to the OpenMP language, that come from the research StarSs programming model, that try to tackle the programmability of emerging heterogeneous architectures. The first group of these extensions aim to enable a more productive and effective parallelization where the user specifies the data dependences among the different parallel tasks of the computation. The compiler and runtime are then responsible to extract the parallelism of the application and perform the needed synchronizations. The second group specifies on which accelerators a task should run. This information also allows the runtime to generate the code to offload the tasks to the accelerators, and combined with the data dependences, it can also generate the code that takes care of the data movement. Furthermore, a user can specify optimized versions of the task code for a given accelerator to take advantage of already optimized operations (e.g., the BLAS libraries) while maintaining a single source code that is portable across multiple platforms. We have presented the challenges of implementing this model in a number of architectures (e.g., multicores, Cell B.E., GPUs and FPGAs). Our experimental results show that our current prototypes can obtain high performance compared to the amount of effort that the programmer needs to devote. While in some cases there is still a gap with respect to hand-tuned version we expect that further research will reduce it. In particular, our current research focuses on improving the scheduling of tasks to minimize the communication of data across the different processing elements of the system and to increase data locality. Other research directions try to increase the amount of available parallelism by removing false dependences (using renaming as outlined in Sect. 3 and establishing a trade off between memory used in data renaming and parallelism exploited at runtime). Finally, we are in the process of integrating all proof-of-concept implementations into a single runtime and including the intelligence described in Sect. 3.5 to make use of different accelerators at the same time. Further scheduling research is needed to make this efficient. Acknowledgments We would like to thank the helpful and constructive comments that we got from the reviewers of this paper. The researchers at BSC-UPC were supported by the Spanish Ministry of Science and Innovation (contracts no. TIN and CSD ), the European Commission in the context of the SARC project (contract no ) and the HiPEAC2 Network of Excellence (contract no. FP7/IST ), and the MareIncognito project under the BSC-IBM collaboration agreement. The

20 researchers at the Universidad Jaime I were supported by the Spanish Ministry of Science and Innovation/FEDER (contracts no. TIN C02-02 and TIN C04-01) and by the Fundación Caixa-Castellón/Bancaixa (contracts no. P1B and P1B ). References 1. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core 86 architecture for visual computing. ACM. Trans. Graph. 27(3), 1 15 (2008) 2. OpenMP Architecture Review Board.: OpenMP 3.0 Specification. May (2008) 3. Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a Programming Model for the Cell BE Architecture. In : Proceedings of the ACM/IEEE SC 2006 Conference, November (2006) 4. Turner, J.A.: Roadrunner: Heterogeneous Petascale Computing for Predictive Simulation. Technical report, Technical Report LANLUR , Los Alamos National Lab, Las Vegas, NV (2007) 5. Kurzak, J., Buttari, A., Dongarra, J.: Solving systems of linear equations on the cell processor using cholesky factorization. IEEE. Trans. Parallel Distrib. Syst. 19(9), (2008) 6. Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of OpenMP tasks. IEEE. Trans. Parallel Distrib. Syst. 20(3), (2009) 7. Pham, D.C., Aipperspach, T., Boerstler, D., Bolliger, M., Chaudhry, R., Cox, D., Harvey, P., Harvey, P.M., Hofstee, H.P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Pham, M., Pille, J., Posluszny, S., Riley, M., Stasiak, D.L., Suzuoki, M., Takahashi, O., Warnock, J., Weitzel, S., Wendel, D., Yazawa, K.: Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor. IEEE J. Solid-State Circuits 41(1), (2006) 8. NVIDIA : NVIDIA CUDA Compute Unified Device Architecture-Programming Guide (2007) 9. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: Stream Computing on Graphics Hardware. In : SIGGRAPH 04: ACM SIGGRAPH 2004 Papers, pp ACM Press, New York (2004) 10. PGI.: PGI Fortran and C Accelerator Compilers and Programming Model Technology Preview. The Portland Group (2008) 11. Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A Hybrid Multi-core Parallel Programming Environment. In : First Workshop on General Purpose Processing on Graphics Processing Units, October (2007) 12. Khronos OpenCL Working Group.: The OpenCL Specification. Aaftab Munshi, Ed (2009)

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

OpenMP extensions for FPGA Accelerators

OpenMP extensions for FPGA Accelerators OpenMP extensions for FPGA Accelerators Daniel Cabrera 1,2, Xavier Martorell 1,2, Georgi Gaydadjiev 3, Eduard Ayguade 1,2, Daniel Jiménez-González 1,2 1 Barcelona Supercomputing Center c/jordi Girona 31,

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Unrolling Loops Containing Task Parallelism

Unrolling Loops Containing Task Parallelism Unrolling Loops Containing Task Parallelism Roger Ferrer 1, Alejandro Duran 1, Xavier Martorell 1,2, and Eduard Ayguadé 1,2 1 Barcelona Supercomputing Center Nexus II, Jordi Girona, 29, Barcelona, Spain

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Analysis of the Task Superscalar Architecture Hardware Design

Analysis of the Task Superscalar Architecture Hardware Design Available online at www.sciencedirect.com Procedia Computer Science 00 (2013) 000 000 International Conference on Computational Science, ICCS 2013 Analysis of the Task Superscalar Architecture Hardware

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

OpenMP tasking model for Ada: safety and correctness

OpenMP tasking model for Ada: safety and correctness www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel

More information

Tutorial OmpSs: Overlapping communication and computation

Tutorial OmpSs: Overlapping communication and computation www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Task Superscalar: Using Processors as Functional Units

Task Superscalar: Using Processors as Functional Units Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero HotPar, June 2010 Yoav Etsion Senior Researcher Parallel Programming

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

CellSs: a Programming Model for the Cell BE Architecture

CellSs: a Programming Model for the Cell BE Architecture CellSs: a Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia and Jesus Labarta Barcelona Supercomputing Center and UPC Building Nexus II, Jordi Girona 29, Barcelona

More information

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies

More information

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE

More information

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M.

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Arquitecturas y Modelos de. Multicore

Arquitecturas y Modelos de. Multicore Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Strategies for Parallelizing the Solution of Rational Matrix Equations

Strategies for Parallelizing the Solution of Rational Matrix Equations Strategies for Parallelizing the Solution of Rational Matrix Equations José M. Badía 1, Peter Benner, Maribel Castillo 1, Heike Faßbender 3, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, and Gregorio Quintana-Ortí

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Informe Técnico ICC 2-2-28 Solving Dense Linear Systems on Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí Febrero de 28 Departamento

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Chapter 17 - Parallel Processing

Chapter 17 - Parallel Processing Chapter 17 - Parallel Processing Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 17 - Parallel Processing 1 / 71 Table of Contents I 1 Motivation 2 Parallel Processing Categories

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

A Case for Hardware Task Management Support for the StarSS Programming Model

A Case for Hardware Task Management Support for the StarSS Programming Model A Case for Hardware Task Management Support for the StarSS Programming Model Cor Meenderinck Delft University of Technology Delft, the Netherlands cor@ce.et.tudelft.nl Ben Juurlink Technische Universität

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

computational power computational

computational power computational rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council

More information

A Uniform Programming Model for Petascale Computing

A Uniform Programming Model for Petascale Computing A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information