A Comparison of Five Parallel Programming Models for C++

Size: px
Start display at page:

Download "A Comparison of Five Parallel Programming Models for C++"

Transcription

1 A Comparison of Five Parallel Programming Models for C++ Ensar Ajkunic, Hana Fatkic, Emina Omerovic, Kristina Talic and Novica Nosovic Faculty of Electrical Engineering, University of Sarajevo, Sarajevo, Bosnia and Herzegovina Abstract Multi-core processors offer a growing potential of parallelism but pose a challenge of program development for achieving high performance in applications. This paper presents a comparison of the five parallel programming models for implementing parallel programs in C++ on multi-core computer systems. The models under consideration are Intel s Thread Building Blocks (TBB), OpenMPI, Intel s Cilk TM Plus, OpenMP and Pthreads. For demonstration purposes multiple parallel implementations of an algorithm for matrix multiplication suitable for parallelization were created. The main goal of this paper is a comprehensive comparison of chosen models with respect to the following criteria: performance and coding effort required. Index Terms Parallel programming, Pthreads, OpenMP, TBB, Cilk++, OpenMPI I. INTRODUCTION Multiprocessor computer systems are now widely available, with all of the processors shipped by manufacturers being multi-core based designed[1]. The current trend towards more processor cores instead of higher clock speeds can be attributed to the physical limitations of modern processor designs [2], [3]. This has serious implications for the design and development of applications as writing and debugging parallel software is difficult and requires more expertise than sequential programming [3]. However, it is vital that programmers become proficient in writing parallel code so that we can harness the parallel power available in multi-core computer systems. Various tools and libraries which help programmers make the transition from sequential to parallel programming are available. These include the compiler directive based OpenMP [9] and Thread Building Blocks [12], which is a high-level library providing parallel constructions and optimized parallel algorithms [4], [1], [2], [5]. Additionally, lower-level libraries such as Pthreads provide more control, but require the programmer to manage threads explicitly [6], [7]. Other parallel and performance tools include Intel Integrated Performance Primitives (IPP), Intel Math Kernel Library (MKL), Microsoft Parallel Patterns Library (PPL). This paper aims to evaluate the performance and code characteristics of the selected five parallel libraries for the parallelization of an algorithm for matrix multiplication. The various parallel implementations are compared on program elapsed time and speedup. Code characteristics such as effort required (measured in lines of code added or changed) and total number of lines will be compared. II. RELATED WORK To support parallelization available from multiple CPU cores software has to be able to spread its workload across multiple processors. On shared memory multiprocessors system, such as those based on multi-core CPU-s, this is typically achieved using multi-threading, although other techniques such as message passing can be employed. It might seem that if a little threading is good, then a lot must be better. In fact, having too many threads can bog down a program. The impact of having too many threads comes in two ways. First, partitioning a fixed amount of work among too many threads gives each thread too little work that the overhead of starting and terminating threads swamps the useful work. Second, having too many threads running incurs overhead from the way they share finite hardware resources. A good solution is to limit the number of runnable threads to the number of hardware threads. Algorithms that were developed used two threads. A. Shared memory based parallel programming models A diverse range of shared memory based parallel programming models are developed up to now. They can be classified into mainly three types as described below. [8] 1) Threading models: These models are based on the thread library that provides low-level library routines for parallelizing the application. These models use mutual exclusion locks and conditional variables for establishing communications and synchronization between threads. Threading models are suitable for applications based on the multiplicity of data and they provide a very high flexibility to programmer. 2) Directive based models: These models use the highlevel compiler directives to parallelize the applications. These models are an extension to the thread based models. The directive based models takes care of the low-level featureslike partitioning, worker management, synchronization and communication among the threads. The main advantages with directive models are that it is easy to write parallel application and programmer doesn t need to consider issues like data races, false sharing, deadlocks. 3) Tasking models: models are based on the concept of specifying tasks instead of threads as done by other models. MIPRO 2012/SP 2203

2 This is because tasks are of short span and more lightweight than threads. One difference between tasks and threads is that tasks are always implemented at user mode. B. Programming Models Evaluated This section describes the parallel programming models that are evaluated in this paper. The models evaluated are: Pthread as threading model OpenMP as directive based model TBB and Cilk++ as task based model MPI as both distributed and shared model 1) Pthreads: Portable Operating System Interface (POSIX) threads are an interface with a set of C language procedures and extensions used for creating and managing threads. [6] They can be easily extended to multiprocessor platforms and are capable for realizing potential gain in performance of parallel programs. It is raw threading model that resides on a shared memory platform leaving most of the implementation details of the parallel programs and more flexible to devel oper. Pthreads has very low-level of abstraction and hence developing the application in this model is hard from the developer perspective. With Pthreads the parallel application developer has more responsibilities like work load artitioning, worker management, communication and synchronization and task mapping. It defines a very wide variety of library routines categorized according to the responsibilities presented above. 2) OpenMP: Open Message passing or Open specification for multiprocessing is an application program interface, which defines a set of program directives, Run time library routines and environment variables that are used to explicitly express direct multi-threaded, shared memory parallelism. [9] It can be specified in C/C++/FORTRAN. OpenMP stand at high-level of abstraction which eases the development of parallel appli cations from the perspective of the developer. OpenMP hides and implements by itself the details like work load partitioning, worker management, communication and synchronization. The developers only need to specify the directives in order to parallelize the application. OpenMP is not widely used as Pthreads and is not emerged as a standard. The flexibility with this model is less compared to Pthreads. 3) TBB: Threading building blocks is a parallel programming library developed by the Intel Corporation. [12] It offers a very highly sophisticated set of parallel primitives to parallelize the application and to enhance the performance of the application on many cores. Intel s threading building blocks is a high-level and supports the task based parallelism to parallelize the applications; it not only replaces the thread ing libraries, but also hides the details about the threading mechanis ms for performance and scalability. Intel s TBB relies on offering the scalable data parallel programming which is much harder to achieve by making use of the performance as the number of processor cores increases. Threading building blocks is a library that supports the scalable parallel program ming by using the C++ code and does not require any special languages or compilers. The underlying library is responsible for mapping the tasks on to the threads in an efficient manner. As a result TBB enables you to specify parallelis m more efficiently than using other models for scalable data parallel programming. In general as one goes deep into the TBB parallel programming model it is hard to understand the model and in some cases increases the development time this is because TBB stands at a high level of abstraction. 4) Cilk++: This is a task based parallel library. [11] It facilitates the fastest development of the parallel applications by just using the three Cilk++ components and a runtime system that are capable of extending to the realms of the parallel programming. It is based on the C++. It is well suited for problems based on the divide and conquer strategy. Recursive functions are often used that are well suitable for the Cilk++ language. The Cilk++ keywords identify function calls and loops that can run in parallel. The Intel's Cilk++ runtime schedules these tasks to run efficiently on the available processors. Eases the development of the applications by imposing the developer to create tasks and also enables to check for the races in the program. 5) MPI: The Message Passing Interface Standard (MPI) is a message passing library standard designed to function on a wide variety of parallel computers.[10] Interface specifications have been defined for C/C++ and Fortran programs. While all other evaluated models work only on symmetric multiprocessing (SMP) systems, MPI works on both SMP and distributed memory systems. It is portable, there is no need to modify source code when porting application to a different platform that supports (and is compliant with) the MPI standard. MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Most MPI routines require specifying a communicator as an argument. Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes, called rank. Ranks are used by the programmer to specify the source and destination of messages. Table 1 shows comparison of five tested parallel programming approaches based on several criteria such as type of parallelis m they are support, complexity, compiler and environment support they require MIPRO 2012/SP

3 TABLE I. COMPARISON OF FIVE PARALLEL PROGRAMMING LANGUAGE A. Pthreads Pthread implementation is somewhat harder then OpenMP, TBB and Cilk implementations as it is programmers job to manage threads creation and scheduling. When implementing parallelism using Pthread four threads were used. The iteration space is then distributed evenly between threads. III. PROBLEM DEFINITION The approach takes focus on experimentation and benchmarking. The algorithm for matrix multiplication was chosen for this simple demonstration of the tools as it provides a best-case scenario for parallelization. As the workload is easily distributed, any overheads introduced by the parallel tools should be made apparent when comparing the performance of the tools. IV. MATRIX MULTIPLICATION PARALLELIZATION Matrix multiplication is binary operation that takes a pair of matrices and produces another matrix. If A is an m-by-n matrix and B is an n-by-p matrix, then their matrix product AB is given by: n i, j i,1 1, j i,2 2, j i, n n, j r 1 A, B A B A B... A B A B, where 1 i m and 1 j p. After an examination of sequential algorithm for matrix multiplication, it is clear that the main part of the work is carried out within three for loops. First two loops are for initialization of matrices, and third one, nested loop, does the multiplication. As such, the primary focus of parallel optimization effort is on making the loop iterations run in parallel. However, a precaution has to be taken to ensure that it is safe and that no race condition are introduced in parallel implementation of the algorithm. It is evident from the structure of initialization loop and outer multiplication loop that there are no data dependencies between loop iterations, unlike the inner loops which have several dependencies. This allows an implementation without use of locks, there is no chance of deadlock. The first parallel implementation was carried out using Pthreads. OpenMP, TBB, Cilk++ and MPI implementations were devised and tested afterwards. i, r r, j B. OpenMP Listing 1. Matrix multiplication source code using Pthread OpenMP implementation uses simple pragma directives for compiler. Listing bellow shows how omp parallel for directive has been used to provide parallelization of loop execution. C. TBB Listing 2. Matrix multiplication source code using OpenMP Thread Building Blocks implementations requires more extensive modification to the original code to implement parallelism. Due the fact that TBB follows object oriented model, multiplication work was implemented in its own class Listing 3: Matrix multiplication source code using TBB MIPRO 2012/SP 2205

4 Blocked range represents range over which to iterate. The call of multiply function is replaced with a call to the TBB parallel for template. D. Cilk++ Similar to TBB implementation, in Cilk++ implementation outer loop was parallelized using cilk for keyword and no other modifications are needed. Cilk for loop is replacement for the normal C/C++ for loop that permits loop iterations to run in parallel. The compiler then converts a loop body to a function that is called recursively using a divide and conquer strategy. Listing 5: Matrix multiplication source code using MPI E. MPI Listing 4: Matrix multiplication source code using Cilk++ MPI implementation of matrix multiplication algorithm requires significant code modification, while data transfer to workers and work partitioning have to be done manually. Also, many index calculation and handling the remaining work when the amount of work is not divisible by the number of workers bring more difficulties for programmers. V. RESULTS To compare Pthreads, OpenMP, TBB, Cilk++ and MPI, multiple implementation of matrix multiplication algorithm were developed. These implementations were then tested and information about execution time and speedup were collected. These results are elaborated in Section V-A. Section V-B elaborates code characteristics (measured in lines of code added or changed) for each model. Testing was performed on a dual-core (Intel CoreTM2 Duo Processor T6500 (2M L2 Cache, 2.10 GHz, 800 MHz FSB)) system with 4GB DDR2 800MHz RAM, running Ubuntu Software versions are as follows: GCC (Cilk Arts build 8503), TBB 4.0, OpenMPI 1.4.4, Boost 1.42 and OpenMP 3.0. A. Matrix multiplication Performance Analysis For benchmarking purposes 11 data sets were created, in creasing matrix size from 1000 to 2000 using step 100. Then these data sets were applied to each algorithm. To ensure that benchmarking process is consistent each developed algorithm was rerun 30 times and calculate average time of execution. The first performance comparison of interest is time consumption. Fig. 1. demonstrates that all tested models show significant improvements in decreasing execution time compared to sequential execution. While it may appear that OpenMP per forms poorly when compared to Cilk++ and TBB, it must be noted that matrix multiplication algorithm is embarrassingly task-parallel problem that is more suited to Cilk++ and TBB, which are task-parallel models, whereas OpenMP is better suited to data-parallel problems and may prove easier to implement and perform better when faced with data-parallel problems. Fig. 2. shows speedup of the various implementations and confirm previous observation that task based models utperform directive and thread based ones MIPRO 2012/SP

5 Fig. 1: Performance of Sequential vs. Parallel Matrix Multiplication using OpenMP, TBB, Pthread, Cilk++ and MPI VI. CONCLUSION Choosing the best model when developing parallel software application depends on multiple factors. The key factors are development environment and complexity of the problem. Pthreads programming model introduces much more complex ity within the code than OpenMP, TBB, Cilk++ and even MPI, making it more challenging to develop. One of the benefits of using TBB, OpenMP or Cilk++ when appropriate is that creating and managing the threads are handled automatically. Even though OpenMP s trivially simple implementation make it reasonable choice for implementing parallelism, as this paper showed, it is better to consider other models when problem is of task-parallel nature. TBB requires significantly more effort to implement, but it provides more control and is better equipped to handle other problems such as task-parallel processes, while still maintaining respectable performance imp rovement when using parallel fo r on dataparallel loops. MPI implementation requires most effort and expertise to implement effectively. One of the advantages of MPI model over other four is that MPI can be used both on shared memory systems and distributed memory systems. REFERENCES [1] R. Merritt. (2008, Apr. 3,) Chip industry confronts software gap between multicore, programming. EETimes. [Online]. Available: [2] K. Carlson. (2008, Mar. 7,) SD West: Parallel or Bust. Dr. Dobbs Journal.[Online].Available: Fig. 2: Speedup trends of Parallel Matrix Multiplication using OpenMP, TBB, Pthread, Cilk++ and MPI over Sequential Implementation B. Matrix Multiplication Code Analysis For code characteristic comparisons, the number of lines of code (LOC) for each implementation as well as the number of modified lines (LOC) were compared over the original sequential version and shown in Table II. OpenMP and Cilk++ require only a single additional line. TBB requires significantly more code modification, but not nearly as much as Pthread, since TBB library handles thread management. TABLE II. CODE MODIFICATION DATA [3] B. Hayes, Computing in a Parallel Universe, American Scientist, vol. 95, pp , [4] M. Sato, OpenMP: parallel programming api for shared memory multiprocessors and on-chip multiprocessors, in ISSS 02: Proceedings of the 15th International Symposium on System Synthesis. New York, NY, USA: ACM, 2002, pp [5] J. Reinders, Intel Threading Building Blocks: Outfitting C++ for MulticoreProcessor Parallelism. OReilly, [6] B. Nichols, D. Buttlar, and J. P. Farrell, Pthreads programming. Sebastopol, CA, USA: OReilly & Associates, Inc., [7] B. Kempf. (2002, May 1,) The Boost.Threads Library. Dr. Dobbs Journal. [Online]. Available: [8] P. Pacheco, An Introduction to Parallel Programming, Morgan Kaufmann, 2011 [9] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, J. McDonald, Parallel Programming in OpenMP, Morgan Kaufmann, 2000 [10] M. S. M uller, M. M. Resch, A. Schulz, W. E. NagelTools, Proceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing, September 2009, ZIH, Dresden As expected Pthread implementation required most code modification since it is programmers responsibility to create, assign and synchronize threads. Significant code modification is presented in MPI because data transfer to workers and work partitioning have to be done explicitly. [11] F. Gebali, Algorithms and Parallel Computing, John Wiley & Sons, 2011 [12] J. Reinders, Intel Threading Building Blocks: Outfitting C++ for Multi- Core Processor Parallelism, O Reilly Media, 2007 MIPRO 2012/SP 2207

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems V Brazilian Symposium on Computing Systems Engineering Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems Alessandro Trindade, Hussama Ismail, and Lucas Cordeiro Foz

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Concurrency, Thread. Dongkun Shin, SKKU

Concurrency, Thread. Dongkun Shin, SKKU Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point

More information

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago. Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples Multicore Jukka Julku 19.2.2009 1 2 3 4 5 6 Disclaimer There are several low-level, languages and directive based approaches But no silver bullets This presentation only covers some examples of them is

More information

Analysis of Parallelization Techniques and Tools

Analysis of Parallelization Techniques and Tools International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Efficiently Introduce Threading using Intel TBB

Efficiently Introduce Threading using Intel TBB Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++

More information

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs. technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Go Deep: Fixing Architectural Overheads of the Go Scheduler

Go Deep: Fixing Architectural Overheads of the Go Scheduler Go Deep: Fixing Architectural Overheads of the Go Scheduler Craig Hesling hesling@cmu.edu Sannan Tariq stariq@cs.cmu.edu May 11, 2018 1 Introduction Golang is a programming language developed to target

More information

Go Multicore Series:

Go Multicore Series: Go Multicore Series: Understanding Memory in a Multicore World, Part 2: Software Tools for Improving Cache Perf Joe Hummel, PhD http://www.joehummel.net/freescale.html FTF 2014: FTF-SDS-F0099 TM External

More information

Parallelization of Graph Isomorphism using OpenMP

Parallelization of Graph Isomorphism using OpenMP Parallelization of Graph Isomorphism using OpenMP Vijaya Balpande Research Scholar GHRCE, Nagpur Priyadarshini J L College of Engineering, Nagpur ABSTRACT Advancement in computer architecture leads to

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Parallel Computers. c R. Leduc

Parallel Computers. c R. Leduc Parallel Computers Material based on B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c 2002-2004 R. Leduc Why Parallel Computing?

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Language and compiler parallelization support for Hash tables

Language and compiler parallelization support for Hash tables Language compiler parallelization support for Hash tables Arjun Suresh Advisor: Dr. UDAY KUMAR REDDY B. Department of Computer Science & Automation Indian Institute of Science, Bangalore Bengaluru, India.

More information

High Performance Computing Implementation on a Risk Assessment Problem

High Performance Computing Implementation on a Risk Assessment Problem High Performance Computing Implementation on a Risk Assessment Problem Carlos A. Acosta 1 and Juan D. Ocampo 2 University of Texas at San Antonio, San Antonio, TX, 78249 Harry Millwater, Jr. 3 University

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

A Lightweight OpenMP Runtime

A Lightweight OpenMP Runtime Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent

More information

1. INTRODUCTION. Keywords: multicore, openmp, parallel programming, speedup.

1. INTRODUCTION. Keywords: multicore, openmp, parallel programming, speedup. Parallel Algorithm Performance Analysis using OpenMP for Multicore Machines Mustafa B 1, Waseem Ahmed. 2 1 Department of CSE,BIT,Mangalore, mbasthik@gmail.com 2 Department of CSE, HKBKCE,Bangalore, waseem.pace@gmail.com

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

CSE 4/521 Introduction to Operating Systems

CSE 4/521 Introduction to Operating Systems CSE 4/521 Introduction to Operating Systems Lecture 5 Threads (Overview, Multicore Programming, Multithreading Models, Thread Libraries, Implicit Threading, Operating- System Examples) Summer 2018 Overview

More information

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING 1 DSP applications DSP platforms The synthesis problem Models of computation OUTLINE 2 DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: Time-discrete representation

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

CS377P Programming for Performance Multicore Performance Multithreading

CS377P Programming for Performance Multicore Performance Multithreading CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs

Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs , pp.45-49 http://dx.doi.org/10.14257/astl.2014.76.12 Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs Hyun-Ji Kim 1, Byoung-Kwi Lee 2, Ok-Kyoon Ha 3, and Yong-Kee Jun 1 1 Department

More information

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018 CS 31: Intro to Systems Threading & Parallel Applications Kevin Webb Swarthmore College November 27, 2018 Reading Quiz Making Programs Run Faster We all like how fast computers are In the old days (1980

More information

MATRIX-VECTOR MULTIPLICATION ALGORITHM BEHAVIOR IN THE CLOUD

MATRIX-VECTOR MULTIPLICATION ALGORITHM BEHAVIOR IN THE CLOUD ICIT 2013 The 6 th International Conference on Information Technology MATRIX-VECTOR MULTIPLICATIO ALGORITHM BEHAVIOR I THE CLOUD Sasko Ristov, Marjan Gusev and Goran Velkoski Ss. Cyril and Methodius University,

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

Structured Parallel Programming Patterns for Efficient Computation

Structured Parallel Programming Patterns for Efficient Computation Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO

More information

Parallel Implementation of the NIST Statistical Test Suite

Parallel Implementation of the NIST Statistical Test Suite Parallel Implementation of the NIST Statistical Test Suite Alin Suciu, Iszabela Nagy, Kinga Marton, Ioana Pinca Computer Science Department Technical University of Cluj-Napoca Cluj-Napoca, Romania Alin.Suciu@cs.utcluj.ro,

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell School of Business and Informatics University of Borås, 50 90 Borås E-mail: Hakan.Sundell@hb.se Philippas

More information

Curriculum 2013 Knowledge Units Pertaining to PDC

Curriculum 2013 Knowledge Units Pertaining to PDC Curriculum 2013 Knowledge Units Pertaining to C KA KU Tier Level NumC Learning Outcome Assembly level machine Describe how an instruction is executed in a classical von Neumann machine, with organization

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Parallel JPEG Color Conversion on Multi-Core Processor

Parallel JPEG Color Conversion on Multi-Core Processor , pp. 9-16 http://dx.doi.org/10.14257/ijmue.2016.11.2.02 Parallel JPEG Color Conversion on Multi-Core Processor Cheong Ghil Kim 1 and Yong-Ho Seo 2,* 1 Department of Computer Science, Namseoul University

More information

Lock vs. Lock-free Memory Project proposal

Lock vs. Lock-free Memory Project proposal Lock vs. Lock-free Memory Project proposal Fahad Alduraibi Aws Ahmad Eman Elrifaei Electrical and Computer Engineering Southern Illinois University 1. Introduction The CPU performance development history

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup Topics Scalable Speedup and Data Locality Parallelizing Sequential Programs Breaking data dependencies Avoiding synchronization overheads Performance and Debugging Issues in OpenMP Achieving Cache and

More information

Shared Memory Programming Model

Shared Memory Programming Model Shared Memory Programming Model Ahmed El-Mahdy and Waleed Lotfy What is a shared memory system? Activity! Consider the board as a shared memory Consider a sheet of paper in front of you as a local cache

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008 1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction

More information

Intel Parallel Studio

Intel Parallel Studio Intel Parallel Studio Product Brief Intel Parallel Studio Parallelism for your Development Lifecycle Intel Parallel Studio brings comprehensive parallelism to C/C++ Microsoft Visual Studio* application

More information

Parallel Programming Interfaces

Parallel Programming Interfaces Parallel Programming Interfaces Background Different hardware architectures have led to fundamentally different ways parallel computers are programmed today. There are two basic architectures that general

More information

Questions from last time

Questions from last time Questions from last time Pthreads vs regular thread? Pthreads are POSIX-standard threads (1995). There exist earlier and newer standards (C++11). Pthread is probably most common. Pthread API: about a 100

More information

Intel Parallel Studio 2011

Intel Parallel Studio 2011 THE ULTIMATE ALL-IN-ONE PERFORMANCE TOOLKIT Studio 2011 Product Brief Studio 2011 Accelerate Development of Reliable, High-Performance Serial and Threaded Applications for Multicore Studio 2011 is a comprehensive

More information

Implementation of Parallelization

Implementation of Parallelization Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization

More information

Programming Shared Memory Systems with OpenMP Part I. Book

Programming Shared Memory Systems with OpenMP Part I. Book Programming Shared Memory Systems with OpenMP Part I Instructor Dr. Taufer Book Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 2 1 Machine

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

A Source-to-Source OpenMP Compiler

A Source-to-Source OpenMP Compiler A Source-to-Source OpenMP Compiler Mario Soukup and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Shared memory: OpenMP Implicit threads: motivations Implicit threading frameworks and libraries take care of much of the minutiae needed to create, manage, and (to

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology

An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology Parallel Computing 33 (2007) 54 72 www.elsevier.com/locate/parco An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology Gregorio Bernabé a, *, Ricardo Fernández

More information

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallelism. Parallel Hardware. Introduction to Computer Systems Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

Loop Modifications to Enhance Data-Parallel Performance

Loop Modifications to Enhance Data-Parallel Performance Loop Modifications to Enhance Data-Parallel Performance Abstract In data-parallel applications, the same independent

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer

More information

I-1 Introduction. I-0 Introduction. Objectives:

I-1 Introduction. I-0 Introduction. Objectives: I-0 Introduction Objectives: Explain necessity of parallel/multithreaded algorithms. Describe different forms of parallel processing. Present commonly used architectures. Introduce a few basic terms. Comments:

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information