All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

Similar documents
CUDA GPGPU Workshop 2012

Concurrency: what, why, how

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Architecture, Programming and Performance of MIC Phi Coprocessor

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Concurrency: what, why, how

Performance of Multicore LUP Decomposition

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Chapter 4: Multithreaded Programming

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EE/CSCI 451: Parallel and Distributed Computation

Applications of Berkeley s Dwarfs on Nvidia GPUs

Intel Parallel Amplifier

Chapter 4: Threads. Chapter 4: Threads

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Task based parallelization of recursive linear algebra routines using Kaapi

QR Decomposition on GPUs

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

The Microkernel Overhead

Adaptive Parallel Exact dense LU factorization

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Questions from last time

Intel Parallel Amplifier 2011

Dynamic Load Balancing in Parallelization of Equation-based Models

Advances of parallel computing. Kirill Bogachev May 2016

OpenCL for programming shared memory multicore CPUs

AutoTune Workshop. Michael Gerndt Technische Universität München

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

CSE 4/521 Introduction to Operating Systems

Getting Reproducible Results with Intel MKL

Introduction to parallel Computing

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Scaling Out Python* To HPC and Big Data

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Ohua: Implicit Dataflow Programming for Concurrent Systems

Efficiently Introduce Threading using Intel TBB

Trends and Challenges in Multicore Programming

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Bring your application to a new era:

Marco Danelutto. May 2011, Pisa

Orleans. Cloud Computing for Everyone. Hamid R. Bazoobandi. March 16, Vrije University of Amsterdam

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

Martin Kruliš, v

Threaded Programming. Lecture 9: Alternatives to OpenMP

Intel Performance Libraries

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

On the cost of managing data flow dependencies

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Accelerating the Iterative Linear Solver for Reservoir Simulation

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

StreamBox: Modern Stream Processing on a Multicore Machine

MAGMA: a New Generation

Chapter 4: Threads. Operating System Concepts 9 th Edition

10th August Part One: Introduction to Parallel Computing

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Exploring Parallelism At Different Levels

White Paper. Executive summary

OPERATING SYSTEM. Chapter 4: Threads

FDS and Intel MPI. Verification Report. on the. FireNZE Linux IB Cluster

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Harp-DAAL for High Performance Big Data Computing

Parallel Computing. Hwansoo Han (SKKU)

Chapter 4: Threads. Operating System Concepts 9 th Edition

Oracle Developer Studio 12.6

Intel Parallel Studio

Optimizing the operations with sparse matrices on Intel architecture

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Cilk Plus GETTING STARTED

CME 213 S PRING Eric Darve

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Optimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology

Map3D V58 - Multi-Processor Version

Simulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions

Introduction to Multithreaded Algorithms

Intel Thread Building Blocks, Part II

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth

Professional Multicore Programming. Design and Implementation for C++ Developers

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Parallel Combinatorial BLAS and Applications in Graph Computations

Cray XE6 Performance Workshop

Upgrading XL Fortran Compilers

Introduction to Multicore Programming

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Intel Parallel Studio XE 2015

OpenFOAM + GPGPU. İbrahim Özküçük

DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

CS420: Operating Systems

White Paper. File System Throughput Performance on RedHawk Linux

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Transcription:

technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance with C++ equivalents. Four functions were implemented. The first two functions were taken from linear algebra. Function 1 computes determinant of a matrix with non-zero leading principal minors. Algorithm of this function is based on LU decomposition of the input matrix. The second function computes a conjugated transposed matrix from a complex input. Function 3 is taken from signal processing. It computes the convolution of two matrices. The algorithm explicitly implements the definition of 2D convolution. Finally, function four is a solution to a system of ordinary differential equations via implementation of the well-known N-body problem. The Runge Kutta algorithm of 4-th order was used to solve the system of differential equations. We have the following equation: Y (t)=f(t,y); Y(t0)=Y0, then the value of the next point will be computed as: Yn+1=Yn+h 6 (K1+2 K2+2 K3+K4), where K1=F(t,Y), K2=F(t+h 2,Y+h 2 K1), K2=F(t+h 2,Y+h 2*K2), K4=F(t+h,Y+h K3). All four algorithms were initially implemented in C++ and then converted into parallel form using the OpenMP, ConcRT and TBB technologies. Tests were run at both: Windows 7 and Windows HPC Server 2008. The Windows 7 machine has next characteristics: Quad-Core Intel(R) Xeon(R) E5440 2.83GHz Processor, 64-bit OS with 4G RAM. The Windows HPC Server 2008 machine has next characteristics: Quad-Core Intel Xeon E5110 1.60GHz Processor, 64-bit OS with 4G RAM. All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs. All tests were run 10+ times. Times of execution were measured using GetTickCount() routine. Average execution time was used in this report to compute speed-up. All comparative charts are placed in Appendices 1-3. 1

Brief review of technologies Concurrency Runtime (ConcRT) technology includes 2 libraries: Parallel Patterns Library (PPL) and Asynchronous Agents Library. The PPL provides general-purpose containers and algorithms for performing fine-grained parallelism. The PPL enables imperative data parallelism by providing parallel algorithms that distribute computations on collections or on sets of data across computing resources. It also enables task parallelism by providing task objects that distribute multiple independent operations across computing resources. The Asynchronous Agents Library (or just Agents Library) provides both an actor-based programming model and message passing interfaces for coarsegrained dataflow and pipelining tasks. Asynchronous agents enable you to make productive use of latency by performing work as other components wait for data. Parallel patterns and Agents libraries present in Visual Studio 2010 (VS 2010). OpenMP uses the fork-join model of parallel execution. Although this forkjoin model can be useful for solving a variety of problems, it is somewhat tailored for large array-based applications. OpenMP is intended to support programs that will execute correctly both as parallel programs (multiple threads of execution and a full OpenMP support library) and as sequential programs (directives ignored and a simple OpenMP stubs library). MS OpenMP v2.5 is built-in in VS 2010. OpenMP is free to be used. Also, there is a Fortran version of OpenMP. Intel Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the latent performance of multicore processors. Intel Threading Building Blocks is used to write scalable applications that: Specify logical parallel structure instead of threads Emphasize data parallel programming Take advantage of concurrent collections and parallel algorithms 2

Both, commercial aligned open source and commercial versions of TBB are available. Summarizing For most mathematical functions parallelism is achieved by executing loops in parallel or by executing independent tasks simultaneously. These mechanisms are available in all examined parallel technologies. After analyzing performance results, we achieved next results. OpenMP technology showed the biggest speed-up on both Windows 7 and Windows HPC Server 2008. Also, this technology showed the best scalability by data size, i.e. the same code works equivalently good on data of different sizes. The code changes to convert serial code into parallel were minimal comparing to ConcRT and TBB. OpenMP can be used in both, C and C++ projects. If OpenMP program executes on 1 core machine, then the slowdown is small. TBB library demonstrated almost the same speed-up as OpenMP, the scalability by data size also looks good. But code changes to convert serial code into parallel are significantly bigger than changes for OpenMP - when parallelizing for-loop, the inner action must be wrapped like a function object, or as lambda function (this feature is present in TBB v3.0). TBB can be used only in C++ projects. The slowdown is small, if TBB application is executed on 1 core machine. ConcRT was the slowest of all tested technologies. The code changes to convert serial code into parallel are only slightly easier than for TBB technology and much bigger than for OpenMP. ConcRT technology can be used only in C++ projects. But, while implementing parallel version of chosen mathematical algorithms, only PPL was used. So, ConcRT has much wider range of use then OpenMP and TBB, which can t be demonstrated by chosen mathematical problems. Anyway, all these technologies can be used simultaneously, so in one project one can mix TBB, OpenMP and ConcRT. My opinion is that for mathematical algorithms, OpenMP is the best technology. 3

Appendix 1 Below are results of the execution time measurements of ConcRT, OpenMP and TBB programs at Windows 7 machine using 1, 2, 3, and 4 cores. Speed-up of ConcRT technology comparing with C++, Windows 7 Speed-up of TBB technology comparing with C++, Windows 7 Speed-up of OpenMP technology comparing with C++, Windows 7 4

Appendix 2 Below are results of the execution time measurements of ConcRT, OpenMP and TBB programs at Windows HPC Server 2008 machine using 1, 2, 3, and 4 cores. Speed-up of ConcRT technology comparing with C++, Windows HPS Server 2008 Speed-up of TBB technology comparing with C++, Windows HPS Server 2008 Speed-up of OpenMP technology comparing with C++, Windows HPS Server 2008 5

Appendix 3 Below are results of the speed-up of parallel applications comparing to serial equivalents for different sizes of input parameters. Applications were tested on Windows 7 machine using 4 cores. Speed-up of parallel technologies for computing determinant of matrices of different sizes Speed-up of parallel technologies for conjugated transpose of matrices of different sizes 6

Speed-up of parallel technologies for convolution of matrices of different sizes Speed-up of parallel technologies for solving N-body problem for different amount of bodies 7