Parallel Ant System on Max Clique problem (using Shared Memory architecture)

Size: px
Start display at page:

Download "Parallel Ant System on Max Clique problem (using Shared Memory architecture)"

Transcription

1 Parallel Ant System on Max Clique problem (using Shared Memory architecture) In the previous Distributed Ants section, we approach the original Ant System algorithm using distributed computing by having the graph partitioned proportionally to the number of 'slaves'. As shown, this method produces promising results, however it depends on the types of graphs and configurations (i.e., the works, ant activities are balanced in all graph regions) ( Note to Dr. Bui: I assume that there is a section on Mike's distributed code prior to this section). In this section, we propose another technique that make use of the readily available multi-processor architecture. This technique enhances the performance the original version (referred to as 'sequential version' hereafter) by taking advantage of the shared-memory multiprocessor architecture. We will first introduce this parallel model, discuss on the parallel techniques applied, conduct analysis on the differences between two versions (i.e., parallel vs sequential) and finally present the results. Parallel computing with OpenMP The symmetric multiprocessor (SMP) architecture we use belongs to the shared-memory (tightly coupled) class (ref 1). This class allows a collection of processors to have access to the shame shared memory. The main benefits of SMP (v.s. distributed computing or cluster) is performance improvement since communication on shared memory is much faster than with distributed memory in cluster 1. The standard view of parallelism in shared-memory uses the fork/join model (see figure B). The program starts with only a single thread (i.e., the master), the master creates slave threads when parallelism is required and work together through the parallel region. When done with the parallel code, all the threads except the Master die (or suspended), and serial execution continues by the Master from that point. OpenMP (MP stands for Multi Processing) is a newly emerged as a standard shared-memory programming model backed by a host of large computer vendors and organizations (e.g., Compaq, HP, Intel, IBM, SGI, Sun and U.S Dept of Energy). The OpenMP API consists of a set of compiler directives and a library of support in functions. The current supporting programming languages include C, C++ and Fortran 2. (Ref 3) Nevertheless, this section is about describing the technical details of OpenMP but rather about the analysis and designing of appropriate parallelism to the existed sequential Ant System algorithm. Converting Existing Ant's Sequential Code to Parallel - March Ants in Parallel 1 Reliable is another benefit, since it's common for nodes (in cluster) to stop functioning but very rare that a processor in SMP stop functioning. However some people argue that this is actually a benefit for cluster since even if nodes go down, cluster still function but if a proc in SMP stop working, the whole machine is gone (i.e, need to take it down and replace the proc). 2All implementations in this section uses the Linux Intel C++ Compiler version with OpenMP support.

2 for each cycles { for (i = 0 : numants) { //for each ant ant(i).choosenextmove(){ ant(i).nextvertex=getnextmoveheuristically(); for (j=0: ant(i).adjedges.size()){ if (edges[ant(i).adjedges[j]].lastupdate < currenttime){ edges[ant(i).adjedges[j]].evaporatepheromone(); edges[ant(i).adjedges[j]].lastupdate= currenttime; for (i=0 : numants) { ant(i).moves(){ deposit a small amount of pheromone to the edge from ant(i).currentvertex to ant(i).nextvertex ant(i).currentvertex = ant(i).nextvertex; ant(i).nextvertex=null; Figure C: sequential version Figure c is the ants' activities section from the sequential version which will be parallelized. The reasons to apply parallelism to this part are because 1) this region take the most computing time (more than 50% 3 ) and 2) the amount of data dependencies can be dealt with safely as shown below. 4 In each cycle, all ants do two main operations: 1. decides where to move next via a heuristic calculation (i.e., more favor toward vertex that have more ants and the edge to those vertex that has more pheromone). There is also another operation inside which evaporate a small amount of pheromone on the edges adjacent to each ant's current vertex. 2. once all the ants made their decision, they will move (i.e., update location configurations, deposit pheromones, etc). The first operation is the most time consuming section since both the heuristic calculation and the pheromone evaporation can very expensive. The decision of an ant is based on the pheromone of its adjacent edges which has the upper bound O(V) (the number of vertices) if that ant happens to be on a vertex connecting to all others vertices. The same complexity is also applied to the evaporation operation since all adjacent edges to the ant's current location must be scanned (to determined if it should be evaporated). The number of ants is set to be v * multiplier (=> 1), thus this section has the complexity of O(V^2) since each ant's takes O(V) time and there is at least V ants. It's tempting to immediately parallelize this ant activities portion by having each processor handles the works of the numants/numprocs 5 ants. However the 3 We used a profiler to study the computation complexity and it shows this ant activities loop take more than 50% of the time. Furthermore, this percentage will increase as the graph is more complex (see explanation on O(V^2) complexity) 4 Similar to any parallel technique, it is important to take care of data dependency. It is most likely due to data dependencies that sequential algorithms cannot be paralleled. 5 OpenMP automatically takes care of the case when n/p is not an even number, by simply have the

3 (heuristic) decision of the ants depend on the pheromone amount on each edge. That is, ant(i)'s decision at cycle(c) may be altered by ant(i-1)'s decisions at cycle(c) (since the evaporation action). A re-design in the algorithm has been made by having all ants make their decision based on the current information and independently of others. For example ant(i) at cycle(c) makes its decision based on the configurations from cycle(c-1) and not based on other ants' decisions at cycle(c). This redesign is a quite a change to the sequential algorithm, however it has no effect on the result quality as shown in the below Difference Analysis and Result sections. There is another dependency in the sequential code: each adjacent edges to each ant's current location can only be updated once at each cycle (i.e., currenttime). If this was to be done in parallel then ants that are currently on the same location might update the information (evaporate pheromone) concurrently, hence a dirty read/write. Fortunately this is easy to deal with, by having the 'decision' of which edges to be updated done in parallel but the actual updating is done sequentially. 6 We created a shared-memory array called bool edgetoupdate[numedges] that has everything reset to 'false' in each cycle. When an edge(e) adjacent to ant(i)'s current location is to be updated (i.e., if lastupdate < currenttime), then edgetoupdate[e] will be set to false. Multiple ants (on different processors) can simultanously write 'true' to the same array index, but that's does not affect any decision since we only need to update an edge once. After all the decision have been made (i.e., the for ant loop is done), edgetoupdate[] will be scanned and each edge with index marked with 'true' will be updated, then it will be reset to 'false' (ready for next cycle). For enhance the performance further, we also parallelize this scanning part if the number of edges is large enough (can be quite large depends on the graphs). The parallel version is presented in figure D below. bool edgetoupdate[numedges] ; // shared variable, declared and init to all false somewhere above for each cycles { //parallel region, master forks threads and work concurrently for (i = 0 : numants) { //for each ant ant(i).choosenextmove(){ ant(i).nextvertex=getnextmoveheuristically(); for (j=0: ant(i).adjedges.size()){ if (edges[ant(i).adjedges[j]].lastupdate==false && edges[ant(i).adjedges[j]].lastupdate < currenttime){ edgestoupdate[ant(i).adjedges[j]]=true; //end parallel region, Master thread continues, other threads suspended //parallel region, master forks threads and work concurrently for (e = 0 : numedges){ if (edgetoupdate[e]==true){ edges[e.evaporatepheromone(); edges[e].lastupdate = currenttime; edgetoupdate[e]=false; //reset last processor takes care of its part and the remaining. For example n=8, p = 3, p1 and p2 will do 2 and p3 will do 3. 6 Note that we can use OpenMP's critical region lock feature (i.e., lock the part that are being used), however experiment shows that this technique takes too much time overhead dues to continuous locking and releasing.

4 //end parallel region, Master thread continues, other threads suspended for (i=0 : numants) { ant(i).moves(){ deposit a small amount of pheromone to the edge from ant(i).currentvertex to ant(i).nextvertex ant(i).currentvertex = ant(i).nextvertex; ant(i).nextvertex=null; Figure D: parallel version - Other sections in parallel One of the greatest benefits of shared memory programming is 'incremental parallelization', which allows applying the fork/join operations wherever needed 7. Thus this makes it very easy (and appealing) to apply parallelism everywhere in sequential code. However it's necessary to determine 8 if the speed up gain exceed the overhead (e.g., the overhead caused by fork/join operations)! The above Ants march in parallel section shows where parallelism is very appropriate and beneficial because this very expensive section appears in almost every Ant System based algorithms. Further optimizations can be done by applying incremental parallelization to other parts including the local optimization, large size for loop (as we have shown in the for all edges loop) or simply any region that can done independently. The below lists several instances in our Max Clique Ant System's local optimization where parallelism gives noticeable performance improvements. We applied parallelism to the set score code in the findclique function in the local optimization (see figure X, Dr. Bui, I assumed Rizzo's section is above this and it has the findclique function). This region's complexity is O(V^2) because it calculates the score of each vertex, i.e., the pheromone amount on the adj edges. Analysis on the difference of the two algorithms TO BE DONE 7(Technical details of fork/join model, its benefits comparing to other parallel techniques can be consult from Ref 3). 8Use analysis such as Amdahl's law to estimate if the speedup is visible before attempting to apply parallelism. And of course, the most accurate is to run a profiler on the program's actual run to determine the exact cpu/time usage to compare.

5 Results Table A shows the expected result qualities: the parallel and sequential version give similar results. (Question to Dr. Bui: do you want to mention it performs better in many cases too? I don't think it's a good idea to claim this since it will confuse the readers that the parallel version design is the main cause for these better qualities, I think it is caused by other factors (e.g. Diff code, etc). Name Vertices Edges Opt RizzoSol ParSol SeqSol AvgParSol AvgSeqSol StdDP StdDS c-fat c-fat johnson johnson keller keller hamming hamming san200_0.7_ san200_0.9_ san200_0.9_ san200_0.9_ san400_0.5_ san400_0.9_ sanr200_ Sanr400_ san Brock200_ brock400_ brock800_ p_hat300_ p_hat300_ p_hat300_ p_hat500_ p_hat500_ p_hat700_ p_hat1000_ p_hat1500_ MANN_a MANN_a Table A : Results Graphs A and B below show the speedup 9 obtained with different number of processors 10. The speedup starts with 1.5 on 2 processors and continuously rises to around 3 on 16 processors for small and simple configurations (e.g., graphs that have small number of vertices, edges and not too fully connected). Furthermore, we even see the speed up less than 1 (i.e., parallel time runs slower than sequential time) in some graphs (e.g., c-fat*). This is as expected since the overhead is a more dominant 9 Speedup = time_sequential / time_parallel, (e.g., if a program takes half the time run in parallel comparing to running in sequential, then the speedup is 2, which is the linear speedup, the *best* and probably never achievable due to overheads in parallel) 10 We use a computer with 16 Intel Itanium processors, 8 Gigs of Ram, running Redhat Linux and all programs compiled with Intel C/C++ compiler.

6 factor in these cases where the programs runs extremely fast (less than 1 seconds) However the excitements come in with large graphs (e.g., MANN_45, man10-2) where speed up steadily rises, around 9 on 16 processors, and it shows no signs of slowing down. By this, we achieve our main goal of applying parallelism to the program which is to increase performance on complex computation. 10 SU SpeedUp brock200_1.clq.b brock400_1.clq.b c-fat200-1.clq.b c-fat500-1.clq.b hamming8-2.clq.b hamming10-2.clq.b johnson clq.b MANN_a27.clq.b MANN_a45.clq.b p_hat300-1.clq.b p_hat300-2.clq.b p_hat300-3.clq.b p_hat500-1.clq.b Speedup p2 p4 p6 p8 p10 p12 p14 p16 Procs Su p2 p4 p6 p8 p10 p12 p14 p16 Procs p_hat500-2.clq.b p_hat700-1.clq.b p_hat clq.b p_hat clq.b san1000.clq.b san200_0.7_1.clq.b san200_0.9_1.clq.b san200_0.9_2.clq.b san200_0.9_3.clq.b san400_0.5_1.clq.b san400_0.9_1.clq.b sanr200_0.7.clq.b sanr400_0.5.clq.b Graphs A and B : Speedup Question to Dr. Bui: do you want graphs? Or you want tabular type of data as below? Or this is fine? Question to Dr. Bui: do you want a conclusion on this section? such as summarizing, re-capture main points, and talk about further enhancements Conclusion: Can apply as much parallelism as possible with OpenMP since it supports of incremental parallel. For examples can simply apply parallelism to the local optimization part, any loop, any section that can be paralleled. Of course have to

7 analyze if it worths it, i.e the advantage of parallel work exceeds fork/join overhead. Further enhancement : hybrid, combining distributed memory with shared memory.

8 Reference Dr. Bui, you can get more information on the books (e.g., isbn, complete author names) by searching for the title from Amazon 1) William Stallings, Operating Systems: Internal and Design, chapter 4: Threads, SMP and Microkernel 2) R.Chandra, L Dagum, D Kohr, D Maydan, J Mcdonald, R. Menon, Parallel Programming with OpenMP, Preface 3) Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, chapter 17: Shared-Memory Programming

Finding Maximum Cliques with Distributed Ants

Finding Maximum Cliques with Distributed Ants Finding Maximum Cliques with Distributed Ants Thang N. Bui, Mike Martys, ThanhVu H. Nguyen and Joseph R. Rizzo, Jr. Department of Computer Science The Pennsylvania State University at Harrisburg Middletown,

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

High Performance Computing Implementation on a Risk Assessment Problem

High Performance Computing Implementation on a Risk Assessment Problem High Performance Computing Implementation on a Risk Assessment Problem Carlos A. Acosta 1 and Juan D. Ocampo 2 University of Texas at San Antonio, San Antonio, TX, 78249 Harry Millwater, Jr. 3 University

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

Computer Organization. Chapter 16

Computer Organization. Chapter 16 William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data

More information

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Operating Systems: Internals and Design Principles, 7/E William Stallings. Chapter 1 Computer System Overview

Operating Systems: Internals and Design Principles, 7/E William Stallings. Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles, 7/E William Stallings Chapter 1 Computer System Overview What is an Operating System? Operating system goals: Use the computer hardware in an efficient

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Organisasi Sistem Komputer

Organisasi Sistem Komputer LOGO Organisasi Sistem Komputer OSK 14 Parallel Processing Pendidikan Teknik Elektronika FT UNY Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple

More information

Parallel Computing Introduction

Parallel Computing Introduction Parallel Computing Introduction Bedřich Beneš, Ph.D. Associate Professor Department of Computer Graphics Purdue University von Neumann computer architecture CPU Hard disk Network Bus Memory GPU I/O devices

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS.

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS. 0104 Cover (Curtis) 11/19/03 9:52 AM Page 1 JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 LINUX M A G A Z I N E OPEN SOURCE. OPEN STANDARDS. THE STATE

More information

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015 Parallel Programming Presentation to Linux Users of Victoria, Inc. November 4th, 2015 http://levlafayette.com 1.0 What Is Parallel Programming? 1.1 Historically, software has been written for serial computation

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP AMath 483/583 Lecture 13 April 25, 2011 Amdahl s Law Today: Amdahl s law Speed up, strong and weak scaling OpenMP Typically only part of a computation can be parallelized. Suppose 50% of the computation

More information

Chapter 17 - Parallel Processing

Chapter 17 - Parallel Processing Chapter 17 - Parallel Processing Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 17 - Parallel Processing 1 / 71 Table of Contents I 1 Motivation 2 Parallel Processing Categories

More information

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed

More information

Programming Shared Memory Systems with OpenMP Part I. Book

Programming Shared Memory Systems with OpenMP Part I. Book Programming Shared Memory Systems with OpenMP Part I Instructor Dr. Taufer Book Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 2 1 Machine

More information

A Lightweight OpenMP Runtime

A Lightweight OpenMP Runtime Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent

More information

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein

More information

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2) Lecture 15 Multiple Processor Systems Multiple Processor Systems Multiprocessors Multicomputers Continuous need for faster computers shared memory model message passing multiprocessor wide area distributed

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Multicore Programming Overview Shared memory systems Basic Concepts in OpenMP Brief history of OpenMP Compiling and running OpenMP programs 2 1 Shared memory systems OpenMP

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Parallel Computing Basics, Semantics

Parallel Computing Basics, Semantics 1 / 15 Parallel Computing Basics, Semantics Landau s 1st Rule of Education Rubin H Landau Sally Haerer, Producer-Director Based on A Survey of Computational Physics by Landau, Páez, & Bordeianu with Support

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

parallel Parallel R ANF R Vincent Miele CNRS 07/10/2015

parallel Parallel R ANF R Vincent Miele CNRS 07/10/2015 Parallel R ANF R Vincent Miele CNRS 07/10/2015 Thinking Plan Thinking Context Principles Traditional paradigms and languages Parallel R - the foundations embarrassingly computations in R the snow heritage

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Shared memory: OpenMP Implicit threads: motivations Implicit threading frameworks and libraries take care of much of the minutiae needed to create, manage, and (to

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

An Introduction to OpenMP

An Introduction to OpenMP An Introduction to OpenMP U N C L A S S I F I E D Slide 1 What Is OpenMP? OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism

More information

Message-Passing Shared Address Space

Message-Passing Shared Address Space Message-Passing Shared Address Space 2 Message-Passing Most widely used for programming parallel computers (clusters of workstations) Key attributes: Partitioned address space Explicit parallelization

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Programming with Shared Memory. Nguyễn Quang Hùng

Programming with Shared Memory. Nguyễn Quang Hùng Programming with Shared Memory Nguyễn Quang Hùng Outline Introduction Shared memory multiprocessors Constructs for specifying parallelism Creating concurrent processes Threads Sharing data Creating shared

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems V Brazilian Symposium on Computing Systems Engineering Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems Alessandro Trindade, Hussama Ismail, and Lucas Cordeiro Foz

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs 2 1 What is OpenMP? OpenMP is an API designed for programming

More information

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:

More information

Chapter 18 - Multicore Computers

Chapter 18 - Multicore Computers Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Memory Systems in Pipelined Processors

Memory Systems in Pipelined Processors Advanced Computer Architecture (0630561) Lecture 12 Memory Systems in Pipelined Processors Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interleaved Memory: In a pipelined processor data is required every

More information

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas Multiple processor systems 1 Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor, access

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Operating Systems, Fall Lecture 9, Tiina Niklander 1 Multiprocessor Systems Multiple processor systems Ch 8.1 8.3 1 Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor,

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts. CPSC/ECE 3220 Fall 2017 Exam 1 Name: 1. Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.) Referee / Illusionist / Glue. Circle only one of R, I, or G.

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

White Paper Jay P. Hoeflinger Senior Staff Software Engineer Intel Corporation. Extending OpenMP* to Clusters

White Paper Jay P. Hoeflinger Senior Staff Software Engineer Intel Corporation. Extending OpenMP* to Clusters White Paper Jay P. Hoeflinger Senior Staff Software Engineer Intel Corporation Extending OpenMP* to Clusters White Paper Extending OpenMP* to Clusters Table of Contents Executive Summary...3 Introduction...3

More information

CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators)

CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators) Name: Sample Solution Email address (UWNetID): CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering.

More information

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan Scientific computing consultant User services group High Performance Computing @ LSU Goals Acquaint users with the concept of shared memory parallelism Acquaint users with

More information

Multi-core Architecture and Programming

Multi-core Architecture and Programming Multi-core Architecture and Programming Yang Quansheng( 杨全胜 ) http://www.njyangqs.com School of Computer Science & Engineering 1 http://www.njyangqs.com Process, threads and Parallel Programming Content

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Analyzing Cache Bandwidth on the Intel Core 2 Architecture John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

High Performance Computing on Windows. Debugging with VS2005 Debugging parallel programs. Christian Terboven

High Performance Computing on Windows. Debugging with VS2005 Debugging parallel programs. Christian Terboven High Permance omputing on Windows Debugging with VS2005 Debugging parallel programs hristian Terboven enter RWTH Aachen University 1 HP on Windows - 2007 enter Agenda Enabling OpenMP and MPI Debugging

More information

High Performance Computing (HPC) Introduction

High Performance Computing (HPC) Introduction High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Intro to Multiprocessors

Intro to Multiprocessors The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple

More information

CONCURRENT DISTRIBUTED TASK SYSTEM IN PYTHON. Created by Moritz Wundke

CONCURRENT DISTRIBUTED TASK SYSTEM IN PYTHON. Created by Moritz Wundke CONCURRENT DISTRIBUTED TASK SYSTEM IN PYTHON Created by Moritz Wundke INTRODUCTION Concurrent aims to be a different type of task distribution system compared to what MPI like system do. It adds a simple

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Superlinear Speedup in Parallel Computation

Superlinear Speedup in Parallel Computation Superlinear Speedup in Parallel Computation Jing Shan jshan@ccs.neu.edu 1 Introduction to The Problem Because of its good speedup, parallel computing becomes more and more important in scientific computations,

More information