A Taxonomy of Task-Based Technologies for High-Performance Computing

Size: px
Start display at page:

Download "A Taxonomy of Task-Based Technologies for High-Performance Computing"

Transcription

1 A Taxonomy of Task-Based Technologies for High-Performance Computing Peter Thoman 1, Khalid Hasanov 2, Kiril Dichev 3, Roman Iakymchuk 4, Xavier Aguilar 4, Philipp Gschwandtner 1, Erwin Laure 4, Herbert Jordan 1, Pierre Lemarinier 2, Kostas Katrinis 2, Dimitrios S. Nikolopoulos 3, Thomas Fahringer 1 1 University of Innsbruck, 6020 Innsbruck, Austria {petert,philipp,herbert,tf}@dps.uibk.ac.at 2 IBM Ireland, Dublin 15, Ireland {khasanov,pierrele,katrinisk}@ie.ibm.com 3 Queen s University of Belfast, Belfast BT7 1NN, United Kingdom {K.Dichev,D.Nikolopoulos}@qub.ac.uk 4 KTH Royal Institute of Technology, Stockholm, Sweden {riakymch,xaguilar,erwinl}@kth.se Abstract. Task-based programming models for shared memory, for example OpenMP and Cilk, have existed for decades, and are well documented. However, with the increase in heterogeneous, many-core and parallel systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing, no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today. Keywords: Task-based parallelism, taxonomy, API, runtime system, scheduler, monitoring framework, fault tolerance. 1 Introduction A large number of task-based programming environments have been developed over the past decades, and even well-established programming languages like C++ have now integrated tasks, allowing their use for shared memory parallelism. For the purpose of this work, we define a task as follows A task is a sequence of instructions within a program that can be processed concurrently with other tasks in the same program. The interleaved execution of tasks may be constrained by control- and data-flow dependencies. The Cilk language [19] allows task-focused parallel programming, and is an early example of efficient task scheduling via work stealing. Language extensions like OpenMP [4] (since version 3.0) integrate tasks into their programming interface. Industry-standard and well-supported parallel libraries based on task parallelism

2 have emerged, such as Intel TBB [24]. Task-based environments for heterogeneous hardware have also naturally developed with the emergence of accelerator and GPU computing; StarPU [8] is an example of such an environment. In addition, task-based parallelism is increasingly employed on distributed memory systems, which constitute the most important target for high-performance computing (HPC). In this context, tasks are often combined with a global address space (GAS) programming model, and scheduled across multiple processes, which together form the distributed execution of a single task-parallel program. While some examples of global address space environments with task-based parallelism are specifically designed languages such as Chapel [6] and X10 [18], it is also possible to implement these concepts as a library. For instance, HPX [10] is an asynchronous GAS runtime, and Charm++ [23] uses a global object space. This already very diverse landscape is made even more complex by the recent appearance of task-based runtimes using novel concepts, such as the data-centric programming language Legion [1]. Many of these task-based programming environments are maintained by a dedicated community of developers, and are often research-oriented. As such, there might be relatively little accessible documentation of their features and inner workings. Crucially, at this point, there is no up to date and comprehensive taxonomy and classification of existing common task-based environments. This makes it very difficult for researchers or developers with an interest in task-based HPC software development to get a concise picture of the alternatives to the omnipresent MPI programming model. In this work, we attempt to address this issue by providing a taxonomy and classification of both state-of-the-art taskbased programming environments and more established alternatives. We consider a task-based environment as consisting of two major components: a programming interface (API) and a runtime system; the former is the interface that a given environment provides to the programmer, while the latter encompasses the underlying implementation mechanisms. We present a set of API characteristics allowing meaningful classification in Section 2. For discussing the more involved topic of runtime mechanisms, we further structure our analysis into the overarching topics of scheduling, performance monitoring, and fault handling (see Section 3). Finally, based on the taxonomy introduced, we classify and categorize existing APIs and runtimes in Section 4. 2 Task-Parallel Programming Interfaces (APIs) The Application Programming Interface (API) of a given task-parallel programming environment defines the way an application developer describes parallelism, dependencies, and in many cases other more specific information such as the work mapping structure or data distribution options. As such, finding a way to concisely characterize APIs from a developer s perspective is crucial in providing an overview of task-parallel technologies. In this work, we define a set of characterizing features for such APIs which encompasses all relevant aspects while remaining as compact as possible. A subset of these features was adapted from previous work by Kasim et al. [11]. To these existing characteristics we added additional information of general importance such as technological readiness levels as well as features which relate to new

3 capabilities particularly relevant for modern HPC like support for heterogeneity and resilience management. We will now define each of these characteristics and their available options for categorization. Note that explicit (e) support generally refers to features which are supported, but require extra effort or implementation by the application developer, while implicit (i) support means that the toolchain manages the feature automatically given a default representation of the program in the API. Technology Readiness The technology readiness of the given API and its implementations according to the European Commission definition. 5 Distributed Memory Whether targeting distributed memory systems is supported. Options are no support, explicit support, or implicit support. explicit refers to, for example, message passing between address spaces, while automatic data migration would be an example of implicit support. Heterogeneity Indicates whether tasks can be executed on accelerators (e.g. GPUs). Again, explicit and implicit as well as no support are possible, where the former means that the application developer has to actively provision tasks to run on accelerators, using a distinct API. Worker Management Whether the worker threads and/or processes need to be started and maintained by the user (explicit) or are provided automatically by the environment (implicit). Task Partitioning This feature indicates whether each task is atomic can, thus, only be scheduled as a single unit or can be subdivided/split. Work Mapping Describes the way tasks are mapped to the existing hardware resources. Possibilities include explicit work mapping, implicit work mapping (e.g. stealing), or pattern-based work mapping. Synchronization Whether tasks are synchronized in an implicit fashion, e.g. by regions or the function scope, or explicitly by the application developer. Resilience Management Describes whether the API has support for task resilience management, e.g. fine-grained checkpointing and restart. Communication Model Either shared memory (smem), message passing (msg), or Partitioned Global Address Space (pgas). Result Handling How the tasking model supports handling the results of task computation: implicit via write-back to existing data, or explicitly provided task result types (which might, for example, be accessed as futures). Graph Structure The type of task graph dependency structure supported by the given API: a tree structure, an acyclic graph (dag) or an arbitrary graph. Task Cancellation Whether the tasking model supports cancellation of tasks: no cancellation support; cancellation is supported either cooperatively (only at task scheduling points) or preemptively. Implementation Type How the API is implemented and addressed from a program. A tasking API can be provided either as a library, a language extension, e.g. pragmas, or an entire language with task integration. 5 annexes/h2020-wp1415-annex-g-trl_en.pdf. Accessed:

4 3 Many-Task Runtime Systems Many-task runtime systems serve as the basis for implementing these APIs, and are considered a promising tool in addressing key issues associated with Exascale computing. In this section we provide a taxonomy of many-task runtime systems, which is summarized and illustrated in Figure 1. A crucial difference among various many-task runtime systems is the target architecture they support. The evolution of many-task runtime systems started from homogeneous shared-memory computers with multiple cores and continued with runtimes for heterogeneous shared-memory and/or distributedmemory systems. The way different runtimes support distributed-memory systems is not uniform in terms of distribution of computations across the nodes. In case of implicit data distribution, data distribution is handled by the runtime, without putting any burden on the application developer. On the other hand, in explicit data distribution, distribution across the nodes is explicitly specified by the programmer. The increase in the number and type of compute units in HPC systems naturally requires efficiency not only in total execution times of applications, but also in power and/or energy. Thus, whether the runtime provides additional scheduling objectives other than the total execution time is another important distinction. At the same time, there is not a single standard scheduling methodology that is being used by all many-task runtime systems. Some of them provide automatic scheduling within a single shared-memory machine while the application developer needs to handle distributed-memory execution explicitly, while others provide uniform scheduling policies across different nodes. Many-task runtimes may require performance introspection and monitoring to facilitate implementation of different scheduling policies. While traditionally it was not part of runtimes, requirements for on-the-fly performance information have surfaced. Thus, most task-based runtimes already provide and make use of introspection capabilities. Fault tolerance is another key factor that is important in many-task runtime systems in the context of Exascale requirements. As detailed in Section 3.3, a runtime may have no resilience capabilities, or it may target task faults or even process faults. 3.1 Scheduling in Many-Task Runtime Systems Task Scheduling Targets Depending on the capabilities of the underlying many-task runtime system, its scheduling domain is usually limited to a single shared-memory homogeneous compute node, a heterogeneous compute node with accelerators, homogeneous distributed-memory systems of interconnected compute nodes, or in a most generic form to heterogeneous distributed-memory systems. By supporting different types of heterogeneous architectures, the runtime can facilitate source code portability and support transparent interaction between different types of computation units for application developers. Traditionally, the execution time has been the main objective to minimize for different scheduling policies. However, the increasing scale of HPC systems makes it necessary to take the energy and power budgeting of the target system into account as well. Therefore, some many-task runtime systems have already started

5 Static Multiobjective Execution time Dynamic Hybrid Scheduling Methods Targets Energy efficiency Explicit Scheduling (Section 3.1) Performance Modelling (Historical Data) Introspection Implicit Data distribution across nodes Many-Task Runtime Systems Performance monitoring (Section 3.2) Online utilization Performance Analysis Offline utilization Target architecture Sharedmemory Distributedmemory Fault tolerance (Section 3.3) Recovery from process faults Performance Analysis None Recovery from task faults Fig. 1: Taxonomy of Many-Task Runtime Systems. providing energy-aware [14] scheduling policies 6. In addition, recent research projects, such as AllScale 7 focus on multi-objective scheduling policies trying to find optimal trade-offs among conflicting optimization objectives like execution time, energy consumption and/or resource utilization. Task Scheduling Methods Extensive research has been conducted in task scheduling methodologies. We do not try to list all different techniques for task scheduling, but rather highlight methods used in state-of-the-art many-task runtime systems. The task scheduling problem can be addressed either in static or dynamic way. In the former case, depending on the decision function it is assumed that either one or more of the following inputs are known in advance: the execution times of each task, inter-dependencies between tasks, task precedence, resource usage of each task, the location of the input data, task communications, and synchronization points. This is by no means an exhaustive list but it gives an indication of the multiple possible a priori inputs for static scheduling. Using all this information the scheduling can be performed offline during com The AllScale EC-funded FET-HPC project: allscale.eu.

6 pilation time. On the other hand, dynamic scheduling is mainly used in the case where there is not enough information in advance or obtaining such information is not trivial. Additionally, hybrid policies which integrate static and dynamic information are possible. Most static scheduling algorithms used in many-task runtime systems are based on the list scheduling methods. Here, it is assumed that the scheduling list of tasks is statically built before any task starts executing and the sequence of the tasks in the list is not modified. The list scheduling approach can easily be adapted and used for dynamic scheduling by re-computing and re-sequencing the list of tasks. As a matter of fact, heuristic policies based on list scheduling and performance models are employed in some many-task runtime systems [8]. Work-stealing [2] can be considered as the most widely used dynamic scheduling method in task-based runtime systems. The main idea in workstealing is to distribute tasks between per-processor work queues, where each processor operates on its local queue. The processors can steal tasks from other queues to perform load-balancing. There are two main approaches in implementing work-stealing, namely, child-stealing and parent-stealing. In parent-stealing, which is also called work-first policy, a worker executes a spawned task and leaves the continuation to be stolen by another worker. Child-stealing, which is also called help-first policy, does the opposite, namely, the worker executes the continuation and leaves the spawned task to be stolen by the other workers. Another approach to dynamic scheduling for many-task runtime systems is the work-sharing strategy. Unlike the work-stealing, it schedules each task onto a processor when it is spawned and it is usually implemented by using a centralized task pool. In work-sharing, whenever a worker spawns a new task, the scheduler migrates it to a new worker to improve load balancing. As such, migration of tasks happens more often in work-sharing than that of in work-stealing. Few of the existing many-task runtime systems provide energy efficient scheduling policies. In the primitive case it is assumed that the application can provide an energy consumption model which can be used by a scheduling policy as part of its objective function. In more advanced cases, the runtime provides offline or online profiling data, such as, instructions per cycle (IPC) and last level cache misses (LLCM). This data is used to build a look-up table that maps each frequency setting with the triple of IPC, LLCM, and the number of active cores. Then, a scheduling decision based on this information [14]. 3.2 Performance Monitoring The high concurrency and dynamic behavior of upcoming Exascale systems poses a demand for performance observation and runtime introspection. This performance information is very valuable to guide HPC runtimes in their execution and resource adaption, thereby maximizing application performance and resource utilization. When targeting performance observation, performance monitoring software is either generating data to be used online [7,1,13,15,16] or offline [15,8,5,1]. In other words, whether the collected data is going to be used while the application still runs or after its execution. Furthermore, this taxonomy can be extended with respect to who is consuming data either the end user (performance analysis)

7 or the runtime itself (introspection and historical data). Real-time performance data (introspection and performance models from historical data) will play an important role in Exascale for runtime adaptation and optimal task-scheduling. 3.3 Task, Process, and System Faults For this topic, we extend a recent taxonomy [22] from the HPC domain to include the concept of task faults. We retain detectability of faults as the main criterion, but distinguish three levels of the system: distributed execution, process, and task (see Figure 2). Each of these levels may experience a fault, and each of them has a different scope. Undetected System Fault Process fault Task fault detects detects Task Process Distributed execution Fig. 2: A taxonomy of faults based on the detection capabilities: task faults, process faults, and system faults. Task Faults: Tasks have the smallest scope of the three; still, a failure of a task may affect the result of a process, and subsequently of a distributed run. A typical example are undetected errors in memory. The process which runs a task is generally capable of detecting task faults. There are several examples of shared-memory runtimes, where task faults within parallel regions have been detected and corrected [20,17]. Process Faults: A process may also fail, which leads to the termination of all underlying tasks. For example, a node crash can lead to a process failure. In such a scenario, a process cannot detect its failure; however, in a distributed run, another process may detect the failure, and trigger a recovery strategy across all processes. A recovery strategy in this case may rely on one of two redundancy techniques: checkpoint/restart or replication. System Faults: On the last level, a distributed system execution may fail in cases of severe faults like switch failure, or power outage. In this case, a failure cannot be detected. No recovery strategy can be applied in such scenarios. 4 Classification Table 1 classifies the existing task-parallel APIs according to the API taxonomy (see Section 2), however with additional clarifications. First, for an API to support a given feature, this API must not require the user to resort to third party libraries or implementation-specific details of the API. For instance, some APIs offer arbitrary task graphs via manual task reference counting [21]. This does not qualify as support in our classification. Second, all APIs shown as featuring task cancellation do so in a non-preemtive manner due to the absence of OS-level preemption capabilities. Some entries require additional clarification. In C++ STL, we consider the entity launched by std::async to represent a task. Also, while StarPU offers shared memory parallelism, it is capable of generating MPI communication from a given task graph and data distribution [8], hence it is marked with explicit support for distributed memory using a message-based communication model. Furthermore, PaRSEC includes both a task-based runtime that works on userspecified task graph and data distribution information, as well as a compiler that

8 Technological Readiness Distributed Memory Heterogeneity Worker Management Task Partitioning Work Mapping Synchronization Resilience Management Communication Model Result Handling Graph Structure Task Cancellation Implementation Type C++ STL 9 i i/e e smem i/e dag TBB 8 i i i smem i tree HPX 6 i e i i/e e pgas e dag Legion 4 i e i i/e e pgas e tree PaRSEC 4 e e i i/e i msg e dag OpenMP 9 i e i i/e smem i dag Charm++ 6 i e i i/e e pgas i/e dag OmpSs 5 i i i i/e smem i dag StarPU 5 e e i i/e e msg i dag Cilk 8 i i e smem i tree Chapel 5 i i i i/e e pgas i dag X10 5 i i i i/e e pgas i dag Table 1: Feature Comparison of APIs for Task Parallelism. Library Extension Lang. accepts serial input and generates this data. As the latter is limited to loops, we only consider the runtime in this work. Several observations can be made from the data presented in Table 1. First, all APIs with distributed memory support also allow task partitioning and support heterogeneity in some form. APIs offering implicit distributed memory support employ a global address space. Second, among APIs lacking distributed memory, only OmpSs offers resilience (via its Nanos++ runtime), and distributed memory APIs only recently started to include resilience support [3] likely driven by the continuous increase in machine sizes and hence decreased mean-time-between-failures. Finally, some form of heterogeneity support is provided in almost all modern APIs, though it often requires explicit heterogeneous task provisioning by the programmer. Table 2 provides the corresponding classification with respect to the runtime system and its subcomponents (see Section 3). It is worth mentioning that there are various contributions extending runtime features, but these contributions are not part of the main release yet. We do not consider such extended features in our taxonomy. For instance, recent work in X10 [12] extends the X10 scheduler with distributed work-stealing algorithms across nodes; however, we classify X10 as not (yet) having a distributed scheduler. The same applies to StarPU and OmpSs. Namely, new distributed-memory scheduling policies are being developed for both runtimes, however, they are not part of their main release yet 8. Also, for Chapel, X10, and HPX, there is automatic data distribution support (runtime feature); however, these runtimes require explicit work mapping in distributed memory environments (API feature). 8 We received feedback from their developers

9 Data Distribution Scheduling on shared memory Scheduling on distr. memory Performance Monitoring Fault Tolerance Target Architecture OpenMP m off/on sm TBB ws off sm Cilk-Plus ws off sm StarPU e m off/on d OmpSs i m off/on tf d Charm++ i m m off/on pf d X10 i ws off pf d Chapel i ws off pf d HPX i ws off/on d ParSEC e m l off tf d Legion i ws ws off/on pf d i implicit e explicit m multiple (incl. ws) l limited ws work stealing tf task faults pf process faults sm shared memory d distributed memory on online use off offline (post-mortem) use Table 2: Feature Comparison of Runtimes for Task Parallelism. Most of the runtime systems have similarities in scheduling within a single shared-memory node and work-stealing is the most common method of scheduling. On the other hand, there is no established method for inter-node scheduling. For instance, ParSEC [9] only provides a limited inter-node scheduling based on remote completion notifications, while Legion uses distributed work-stealing. 5 Conclusions The shift in HPC towards emerging task-based parallel programming paradigms has led to a broad ecosystem of different task-based technologies. With such diversity, and some degree of isolation between individual communities of developers, there is a lack of documentation and common classification, thus hindering researchers to have a complete view of the field. In this paper, we provide an initial attempt to establish a common taxonomy and provide the corresponding categorization for many existing task-based programming environments. We divided our taxonomy into two broad categories: API characteristics, which define how the programmer interacts with the system; and many-task runtime systems, classifying the underlying technologies. For the latter, we analyze the types of scheduling policies and goals supported, online and offline performance monitoring integration, as well as the level of resilience and detection provided for task, process and system faults. Acknowledgement This work was partially supported by the AllScale project that has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No

10 References 1. M. E. Bauer. Legion: programming distributed heterogeneous architectures with logical regions. PhD thesis, Stanford University, R.D. Blumofe and C.E Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM (JACM), 46(5): , D. Cunningham. Resilient x10: Efficient failure-aware programming. In Proceedings of PPoPP14, pages ACM, L. Dagum and R. Menon. Openmp: an industry standard api for shared-memory programming. IEEE computational science and engineering, 5(1):46 55, A. Duran et al. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02): , B. L. Chamberlain et al. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 21(3): , C. Augonnet et al. Automatic calibration of performance models on heterogeneous multicore architectures. In Euro-Par 2009, pages Springer, C. Augonnet et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2): , G. Bosilca et al. Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science Engineering, 15(6):36 45, Nov H. Kaiser et al. Hpx: A task based programming model in a global address space. In PGAS 14, page 6. ACM, H. Kasim et al. Survey on parallel programming model. In Proceedings of NPC 08, pages Springer, J. Paudel et al. On the merits of distributed work-stealing on selective localityaware tasks. In nd International Conference on Parallel Processing, pages , Oct J. Planas et al. Self-adaptive ompss tasks in heterogeneous environments. In IPDPS13, pages IEEE, J.C. Meyer et al. Implementation of an Energy-Aware OmpSs Task Scheduling Policy. Accessed: K. Huck et al. An early prototype of an autonomic performance environment for exascale. In Proceedings of ROSS13, page 8. ACM, K. Huck et al. An autonomic performance environment for exascale. Supercomputing frontiers and innovations, 2(3):49 66, O. Subasi et al. Nanocheckpoints: A task-based asynchronous dataflow framework for efficient and scalable checkpoint/restart. In PDP 15, pages , Ph. Charles et al. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of OOPSLA05, pages ACM, R. D. Blumofe et al. Cilk: An efficient multithreaded runtime system. Journal of parallel and distributed computing, 37(1):55 69, S. Hukerikar et al. Opportunistic application-level fault detection through adaptive redundant multithreading. In Proceedings of HPCS14, pages , General Acyclic Graphs of Tasks in TBB. node/ Accessed: M. Hoemmen and M.A. Heroux. Fault-tolerant iterative methods via selective reliability. In Proceedings of SC11. IEEE Computer Society, page 9, L. V. Kale and S. Krishnan. Charm++: A portable concurrent object oriented system based on c++. In OOSPLA93, pages ACM, Th. Willhalm and N. Popovici. Putting intel threading building blocks to work. In IWMSE08, pages 3 4. ACM, 2008.

A taxonomy of task-based parallel programming technologies for high-performance computing

A taxonomy of task-based parallel programming technologies for high-performance computing J Supercomput (2018) 74:1422 1434 https://doi.org/10.1007/s11227-018-2238-4 A taxonomy of task-based parallel programming technologies for high-performance computing Peter Thoman 1 Kiril Dichev 2 Thomas

More information

An Exascale Programming, Multi objective Optimisation and Resilience Management Environment Based on Nested Recursive Parallelism.

An Exascale Programming, Multi objective Optimisation and Resilience Management Environment Based on Nested Recursive Parallelism. This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No. 671603 An Exascale Programming, ulti objective Optimisation and Resilience

More information

D4.7 Multi-Objective Dynamic Optimizer (b)

D4.7 Multi-Objective Dynamic Optimizer (b) H2020 FETHPC-1-2014 An Exascale Programming, Multi-objective Optimisation and Resilience Management Environment Based on Nested Recursive Parallelism Project Number 671603 D4.7 Multi-Objective Dynamic

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Dynamic inter-core scheduling in Barrelfish

Dynamic inter-core scheduling in Barrelfish Dynamic inter-core scheduling in Barrelfish. avoiding contention with malleable domains Georgios Varisteas, Mats Brorsson, Karl-Filip Faxén November 25, 2011 Outline Introduction Scheduling & Programming

More information

D6.1 AllScale Computing Infrastructure

D6.1 AllScale Computing Infrastructure H2020 FETHPC-1-2014 An Exascale Programming, Multi-objective Optimisation and Resilience Management Environment Based on Nested Recursive Parallelism Project Number 671603 D6.1 AllScale Computing Infrastructure

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

Postprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.

Postprint.   This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,

More information

Designing Issues For Distributed Computing System: An Empirical View

Designing Issues For Distributed Computing System: An Empirical View ISSN: 2278 0211 (Online) Designing Issues For Distributed Computing System: An Empirical View Dr. S.K Gandhi, Research Guide Department of Computer Science & Engineering, AISECT University, Bhopal (M.P),

More information

Victor Malyshkin (Ed.) Malyshkin (Ed.) 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings

Victor Malyshkin (Ed.) Malyshkin (Ed.) 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings Victor Malyshkin (Ed.) Lecture Notes in Computer Science The LNCS series reports state-of-the-art results in computer science re search, development, and education, at a high level and in both printed

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models CIEL: A Universal Execution Engine for

More information

Perspectives form US Department of Energy work on parallel programming models for performance portability

Perspectives form US Department of Energy work on parallel programming models for performance portability Perspectives form US Department of Energy work on parallel programming models for performance portability Jeremiah J. Wilke Sandia National Labs Livermore, CA IMPACT workshop at HiPEAC Sandia National

More information

A Context-aware Primitive for Nested Recursive Parallelism

A Context-aware Primitive for Nested Recursive Parallelism A Context-aware Primitive for Nested Recursive Parallelism Herbert Jordan 1, Peter Thoman 1, Peter Zangerl 1, Thomas Heller 2, and Thomas Fahringer 1 1 University of Innsbruck, {herbert,petert,peterz,tf}@dps.uibk.ac.at

More information

1 Publishable Summary

1 Publishable Summary 1 Publishable Summary 1.1 VELOX Motivation and Goals The current trend in designing processors with multiple cores, where cores operate in parallel and each of them supports multiple threads, makes the

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

A Universal Micro-Server Ecosystem Exceeding the Energy and Performance Scaling Boundaries

A Universal Micro-Server Ecosystem Exceeding the Energy and Performance Scaling Boundaries A Universal Micro-Server Ecosystem Exceeding the Energy and Performance Scaling Boundaries www.uniserver2020.eu UniServer facilitates the advent of IoT solutions through the adoption of a distributed infrastructure

More information

OpenMP tasking model for Ada: safety and correctness

OpenMP tasking model for Ada: safety and correctness www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces

DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces An Huynh University of Tokyo, Japan Douglas Thain University of Notre Dame, USA Miquel Pericas Chalmers University of Technology,

More information

Programming Models for Supercomputing in the Era of Multicore

Programming Models for Supercomputing in the Era of Multicore Programming Models for Supercomputing in the Era of Multicore Marc Snir MULTI-CORE CHALLENGES 1 Moore s Law Reinterpreted Number of cores per chip doubles every two years, while clock speed decreases Need

More information

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle CS528 Slides are adopted from http://supertech.csail.mit.edu/cilk/ Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding

More information

Towards an Adaptive Task Pool Implementation

Towards an Adaptive Task Pool Implementation Towards an Adaptive Task Pool Implementation M. Hofmann and G. Rünger Department of Computer Science Chemnitz University of Technology, Germany E-mail: {mhofma,ruenger}@informatik.tu-chemnitz.de Abstract

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

The DEEP (and DEEP-ER) projects

The DEEP (and DEEP-ER) projects The DEEP (and DEEP-ER) projects Estela Suarez - Jülich Supercomputing Centre BDEC for Europe Workshop Barcelona, 28.01.2015 The research leading to these results has received funding from the European

More information

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München Energy Efficiency Tuning: READEX Madhura Kumaraswamy Technische Universität München Project Overview READEX Starting date: 1. September 2015 Duration: 3 years Runtime Exploitation of Application Dynamism

More information

MTAPI: Parallel Programming for Embedded Multicore Systems

MTAPI: Parallel Programming for Embedded Multicore Systems MTAPI: Parallel Programming for Embedded Multicore Systems Urs Gleim Siemens AG, Corporate Technology http://www.ct.siemens.com/ urs.gleim@siemens.com Markus Levy The Multicore Association http://www.multicore-association.org/

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Announcements. me your survey: See the Announcements page. Today. Reading. Take a break around 10:15am. Ack: Some figures are from Coulouris

Announcements.  me your survey: See the Announcements page. Today. Reading. Take a break around 10:15am. Ack: Some figures are from Coulouris Announcements Email me your survey: See the Announcements page Today Conceptual overview of distributed systems System models Reading Today: Chapter 2 of Coulouris Next topic: client-side processing (HTML,

More information

Application Example Running on Top of GPI-Space Integrating D/C

Application Example Running on Top of GPI-Space Integrating D/C Application Example Running on Top of GPI-Space Integrating D/C Tiberiu Rotaru Fraunhofer ITWM This project is funded from the European Union s Horizon 2020 Research and Innovation programme under Grant

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

Multiprocessors 2007/2008

Multiprocessors 2007/2008 Multiprocessors 2007/2008 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel processing Scope: several

More information

Semantic Web in a Constrained Environment

Semantic Web in a Constrained Environment Semantic Web in a Constrained Environment Laurens Rietveld and Stefan Schlobach Department of Computer Science, VU University Amsterdam, The Netherlands {laurens.rietveld,k.s.schlobach}@vu.nl Abstract.

More information

Techniques to improve the scalability of Checkpoint-Restart

Techniques to improve the scalability of Checkpoint-Restart Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Distributed Systems. Lecture 4 Othon Michail COMP 212 1/27

Distributed Systems. Lecture 4 Othon Michail COMP 212 1/27 Distributed Systems COMP 212 Lecture 4 Othon Michail 1/27 What is a Distributed System? A distributed system is: A collection of independent computers that appears to its users as a single coherent system

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

AUTOMATIC SMT THREADING

AUTOMATIC SMT THREADING AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY

More information

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Provably Efficient Non-Preemptive Task Scheduling with Cilk Provably Efficient Non-Preemptive Task Scheduling with Cilk V. -Y. Vee and W.-J. Hsu School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore 639798. Abstract We consider the

More information

Welcome to the 2017 Charm++ Workshop!

Welcome to the 2017 Charm++ Workshop! Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 2017

More information

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery White Paper Business Continuity Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery Table of Contents Executive Summary... 1 Key Facts About

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM Szabolcs Pota 1, Gergely Sipos 2, Zoltan Juhasz 1,3 and Peter Kacsuk 2 1 Department of Information Systems, University of Veszprem, Hungary 2 Laboratory

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Programming Models for Largescale

Programming Models for Largescale Programming odels for Largescale Parallelism Peter Thoman University of Innsbruck Outline Origin Story State of the Art Well-established APIs and their issues The AllScale Approach What we are working

More information

Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing

Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing Bo Kågström Department of Computing Science and High Performance Computing Center North (HPC2N) Umeå University,

More information

Chapter 4. Fundamental Concepts and Models

Chapter 4. Fundamental Concepts and Models Chapter 4. Fundamental Concepts and Models 4.1 Roles and Boundaries 4.2 Cloud Characteristics 4.3 Cloud Delivery Models 4.4 Cloud Deployment Models The upcoming sections cover introductory topic areas

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Overview: Emerging Parallel Programming Models

Overview: Emerging Parallel Programming Models Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum

More information

Distributed OS and Algorithms

Distributed OS and Algorithms Distributed OS and Algorithms Fundamental concepts OS definition in general: OS is a collection of software modules to an extended machine for the users viewpoint, and it is a resource manager from the

More information

IOS: A Middleware for Decentralized Distributed Computing

IOS: A Middleware for Decentralized Distributed Computing IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc

More information

Distributed Computing Environment (DCE)

Distributed Computing Environment (DCE) Distributed Computing Environment (DCE) Distributed Computing means computing that involves the cooperation of two or more machines communicating over a network as depicted in Fig-1. The machines participating

More information

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Accelerate Your Enterprise Private Cloud Initiative

Accelerate Your Enterprise Private Cloud Initiative Cisco Cloud Comprehensive, enterprise cloud enablement services help you realize a secure, agile, and highly automated infrastructure-as-a-service (IaaS) environment for cost-effective, rapid IT service

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

High performance computing and numerical modeling

High performance computing and numerical modeling High performance computing and numerical modeling Volker Springel Plan for my lectures Lecture 1: Collisional and collisionless N-body dynamics Lecture 2: Gravitational force calculation Lecture 3: Basic

More information

Cloud Programming James Larus Microsoft Research. July 13, 2010

Cloud Programming James Larus Microsoft Research. July 13, 2010 Cloud Programming James Larus Microsoft Research July 13, 2010 New Programming Model, New Problems (and some old, unsolved ones) Concurrency Parallelism Message passing Distribution High availability Performance

More information

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes. DARMA Janine C. Bennett, Jonathan Lifflander, David S. Hollman, Jeremiah Wilke, Hemanth Kolla, Aram Markosyan, Nicole Slattengren, Robert L. Clay (PM) PSAAP-WEST February 22, 2017 Sandia National Laboratories

More information

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser HPX The C++ Standards Library for Concurrency and Hartmut Kaiser (hkaiser@cct.lsu.edu) HPX A General Purpose Runtime System The C++ Standards Library for Concurrency and Exposes a coherent and uniform,

More information

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH 1 Exascale Programming Models With the evolution of HPC architecture towards exascale, new approaches for programming these machines

More information

Middleware Mediated Transactions & Conditional Messaging

Middleware Mediated Transactions & Conditional Messaging Middleware Mediated Transactions & Conditional Messaging Expert Topic Report ECE1770 Spring 2003 Submitted by: Tim Chen John C Wu To: Prof Jacobsen Date: Apr 06, 2003 Electrical and Computer Engineering

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Version 1.0, 30 th September 2018 Jakub Šístek, Reazul Hoque and George Bosilca Table of Contents TABLE OF CONTENTS... 2 1 INTRODUCTION... 3 2

More information

A Survey on Grid Scheduling Systems

A Survey on Grid Scheduling Systems Technical Report Report #: SJTU_CS_TR_200309001 A Survey on Grid Scheduling Systems Yanmin Zhu and Lionel M Ni Cite this paper: Yanmin Zhu, Lionel M. Ni, A Survey on Grid Scheduling Systems, Technical

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Architectural Blueprint

Architectural Blueprint IMPORTANT NOTICE TO STUDENTS These slides are NOT to be used as a replacement for student notes. These slides are sometimes vague and incomplete on purpose to spark a class discussion Architectural Blueprint

More information

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms Shuoxin Lin, Yanzhou Liu, William Plishker, Shuvra Bhattacharyya Maryland DSPCAD Research Group Department of

More information

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology Mobile and Heterogeneous databases Distributed Database System Transaction Management A.R. Hurson Computer Science Missouri Science & Technology 1 Distributed Database System Note, this unit will be covered

More information

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED

More information

Systematic Cooperation in P2P Grids

Systematic Cooperation in P2P Grids 29th October 2008 Cyril Briquet Doctoral Dissertation in Computing Science Department of EE & CS (Montefiore Institute) University of Liège, Belgium Application class: Bags of Tasks Bag of Task = set of

More information

Message Passing. Advanced Operating Systems Tutorial 5

Message Passing. Advanced Operating Systems Tutorial 5 Message Passing Advanced Operating Systems Tutorial 5 Tutorial Outline Review of Lectured Material Discussion: Barrelfish and multi-kernel systems Programming exercise!2 Review of Lectured Material Implications

More information

Cluster-Based Scalable Network Services

Cluster-Based Scalable Network Services Cluster-Based Scalable Network Services Suhas Uppalapati INFT 803 Oct 05 1999 (Source : Fox, Gribble, Chawathe, and Brewer, SOSP, 1997) Requirements for SNS Incremental scalability and overflow growth

More information

Chapter 20: Database System Architectures

Chapter 20: Database System Architectures Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

Scalable In-memory Checkpoint with Automatic Restart on Failures

Scalable In-memory Checkpoint with Automatic Restart on Failures Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems On Object Orientation as a Paradigm for General Purpose Distributed Operating Systems Vinny Cahill, Sean Baker, Brendan Tangney, Chris Horn and Neville Harris Distributed Systems Group, Dept. of Computer

More information

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1 Evolution of Performance in High Performance Computing Exascale = 10 18

More information

Load Balancing in the Macro Pipeline Multiprocessor System using Processing Elements Stealing Technique. Olakanmi O. Oladayo

Load Balancing in the Macro Pipeline Multiprocessor System using Processing Elements Stealing Technique. Olakanmi O. Oladayo Load Balancing in the Macro Pipeline Multiprocessor System using Processing Elements Stealing Technique Olakanmi O. Oladayo Electrical & Electronic Engineering University of Ibadan, Ibadan Nigeria. Olarad4u@yahoo.com

More information

Practical Considerations for Multi- Level Schedulers. Benjamin

Practical Considerations for Multi- Level Schedulers. Benjamin Practical Considerations for Multi- Level Schedulers Benjamin Hindman @benh agenda 1 multi- level scheduling (scheduler activations) 2 intra- process multi- level scheduling (Lithe) 3 distributed multi-

More information

DuctTeip: A TASK-BASED PARALLEL PROGRAMMING FRAMEWORK FOR DISTRIBUTED MEMORY ARCHITECTURES

DuctTeip: A TASK-BASED PARALLEL PROGRAMMING FRAMEWORK FOR DISTRIBUTED MEMORY ARCHITECTURES DuctTeip: A TASK-BASED PARALLEL PROGRAMMING FRAMEWORK FOR DISTRIBUTED MEMORY ARCHITECTURES AFSHIN ZAFARI, ELISABETH LARSSON, AND MARTIN TILLENIUS Abstract. Current high-performance computer systems used

More information

Part IV. Chapter 15 - Introduction to MIMD Architectures

Part IV. Chapter 15 - Introduction to MIMD Architectures D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information