Introduction to Multicore Programming

Similar documents
Introduction to Multicore Programming

Chapter 4: Multi-Threaded Programming

CS Lecture 3! Threads! George Mason University! Spring 2010!

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

CUDA GPGPU Workshop 2012

Threads. CS3026 Operating Systems Lecture 06

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

OPERATING SYSTEM. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads

OpenACC Course. Office Hour #2 Q&A

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

4.8 Summary. Practice Exercises

CS420: Operating Systems

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Chapter 4: Multithreaded Programming

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

15-418, Spring 2008 OpenMP: A Short Introduction

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

I.-C. Lin, Assistant Professor. Textbook: Operating System Concepts 8ed CHAPTER 4: MULTITHREADED PROGRAMMING

Chapter 4: Threads. Operating System Concepts 9 th Edition

CS333 Intro to Operating Systems. Jonathan Walpole

CS 475. Process = Address space + one thread of control Concurrent program = multiple threads of control

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Lecture Topics. Announcements. Today: Threads (Stallings, chapter , 4.6) Next: Concurrency (Stallings, chapter , 5.

Trends and Challenges in Multicore Programming

JCudaMP: OpenMP/Java on CUDA

OpenACC 2.6 Proposed Features

CS 261 Fall Mike Lam, Professor. Threads

Chapter 4: Threads. Chapter 4: Threads

CS 326: Operating Systems. Process Execution. Lecture 5

Last class: Today: Thread Background. Thread Systems

Concurrency, Thread. Dongkun Shin, SKKU

Threads. Thread Concept Multithreading Models User & Kernel Threads Pthreads Threads in Solaris, Linux, Windows. 2/13/11 CSE325 - Threads 1

Chapter 3 Parallel Software

Chapter 5: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads Linux Threads Java Threads

Threaded Programming. Lecture 9: Alternatives to OpenMP

Chapter 4: Threads. Operating System Concepts 9 th Edition

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Threads. Still Chapter 2 (Based on Silberchatz s text and Nachos Roadmap.) 3/9/2003 B.Ramamurthy 1

Application Programming

! Readings! ! Room-level, on-chip! vs.!

CSE 4/521 Introduction to Operating Systems

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Parallel Programming Libraries and implementations

Chapter 4: Threads. Operating System Concepts 8 th Edition,

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

CS307: Operating Systems

Questions from last time

How to Optimize Geometric Multigrid Methods on GPUs

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallelism paradigms

Parallel Programming with OpenMP. CS240A, T. Yang

Chapter 4: Multithreaded

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Chapter 4: Multithreaded Programming

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Chapter 5: Threads. Outline

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

<Insert Picture Here> OpenMP on Solaris

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

Parallel Programming in C with MPI and OpenMP

Introduction to pthreads

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

High Performance Computing on GPUs using NVIDIA CUDA

Modern Processor Architectures. L25: Modern Compiler Design

Distributed Systems + Middleware Concurrent Programming with OpenMP

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Accelerating image registration on GPUs

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Outline. Threads. Single and Multithreaded Processes. Benefits of Threads. Eike Ritter 1. Modified: October 16, 2012

Allows program to be incrementally parallelized

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Shared Memory Programming Model

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Parallel Programming. Libraries and Implementations

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Programming Languages COMP360

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Overview of research activities Toward portability of performance

CS370 Operating Systems

Introduction to OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

[Potentially] Your first parallel application

Professional Multicore Programming. Design and Implementation for C++ Developers

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

CSci 4061 Introduction to Operating Systems. (Thread-Basics)

Transcription:

Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering

2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2

Multithreaded Programming

4 Processes Process: a program in execution Unit of execution (unit of scheduling) A process includes PID (process id) Non-negative integer value Address space (not accessible from other processes) Code, data, stack, heap, States Ready, running, waiting, Current activity Current values of processor registers 4

5 Threads Thread is a unit of concurrent execution Also can be viewed as a unit of CPU scheduling Sometimes called lightweight process Thread has a thread ID, a program counter, a register set, and a stack POSIX threads POSIX (Portable Operating System Interface) is a family of standards specified by the IEEE POSIX threads, Pthreads, is a POSIX standard for threads 5

6 Single- vs. Multi-Threaded Application 6

7 Creating a New thread 7

8 Passing and Reading Data Passing a value into a created thread Reading the parameter passed to the new thread 8

9 Waiting for a Thread to Terminate Waiting with the pthread_join( ) function 9

10 Thread Private Data Variables held on the stack are thread-private data Parameters passed into functions are also thread private data Both variables a and b are private to a thread 10

11 Thread Local Variable A thread-local variable is global data that is visible to all the routines, but each thread sees only its own private copy of the data The variable mydata is local to the thread, so each thread can see a different value for the variable 11

12 Example: Multithreading for Web Server Web server process Dispatcher thread Worker thread User space Web page cache Kernel Kernel space Network connection 12

13 Example: Multithreading for Web Server while (TURE){ get_next_request(&buf); dispatch_work(&buf); } Dispatcher Thread while (TURE){ wait_for_work(&buf) look_for_page_in_cache(&buf, &page); if(page_not_in_cache(&page)) read_page_from_disk(&buf, &page); return_page(&page); } Worker Thread 13

Automatic Parallelization and OpenMP

15 Automatic Parallelization An ideal compiler could be able to manage everything about parallelization From identifying parallel parts to running them in parallel However, current compilers can only automatically parallelize loops Just as another compiler optimization Without some help from the developer, compilers will rarely be able to exploit all the parallelism 15

16 Parallelization with Autopar The Solaris Studio C compiler 16

17 Parallelization with Autopar The compiler can parallelize the first loop It is not able to parallelize the second loop Because of the function call in the second loop No standard way to denote that a function call can be safely parallelized 17

18 Parallelization with the Intel Compiler Example code for matrix multiplication 18

19 Parallelization with the Intel Compiler Reason for the failure of parallelization The possibility of aliasing between the store to out[i] and the values of the loop bound, *row and *col For the compiler, it is safe to assume aliasing Modified code Still exists the possibility of aliasing due to out, mat, and vec. 19

20 OpenMP OpenMP is the most commonly used language extension for parallelism It defines an API that enables a developer to add directives for the compiler to parallelize the application It follows a fork-join type model Parallelization of a loop with OpenMP 20

21 Parallel Region When a parallel region is encountered, the work will be divided between a group of threads The # of threads is set by the environment variable OMP_NUM_THREADS, or by the application at runtime by calls into the runtime support library 21

22 Variable Scoping The variables can be scoped in two ways Shared: each thread shares the same variable Private: each thread gets its own copy of the variable The scoping rules in OpenMP are quite complex The simplified summary of OpenMP rules The loop induction variable as being private Variables defined in the parallel code as being private Variables defined outside the parallel region as being shared 22

23 Explicit Variable Scoping The simplified rules should be appropriate in simple situations, but may not be appropriate in complex ones In these situations, it is better to manually define the variable scoping 23

24 Reductions Not all variables can be scoped as either shared or private Can we scope total as shared? What if we use a semaphore to serialize access to total 24

25 Reductions OpenMP allows for a reduction operation Each thread has a private copy of the reduction variable in the parallel region At the end of the parallel region, the private copies are combined to produce the final result Other operations include subtraction; multiplication; the bitwise operations AND, OR, and XOR; and the logical operations AND and OR. 25

26 Static Work Scheduling The default scheduling for a parallel for loop is called static scheduling The iterations are divided evenly, in chunks of consecutive iterations, between the threads In some cases, a different amount of work can be performed in each iteration Consequently, both threads may complete the same number of iterations, but one may have more work to do in those iterations 26

27 Dynamic Work Scheduling The work is divided into multiple chunks of work and each thread takes the next chunk of work when it completes a chunk of work 27

GPGPU

29 GPGPU General-Purpose Computing on GPUs GPGPU is the utilization of a GPU to perform computation in applications traditionally handled by CPU GPUs can be viewed as compute co-processors GPGPU technology Compute Unified Device Architecture (CUDA) from Nvidia Open Computing Language (OpenCL) from ATI and Nvidia Adopted by Apple, Intel, Qualcomm, 29

30 Two Considerations in GPGPU Different instruction sets GPUs do not share the instruction set of the host processor This has to produce code for the host processor, together with code for the GPU Different address spaces Data needs to be copied across to the GPU The act of copying is time consuming The problem should be large enough to justify the cost of the copy operation 30

31 CUDA Processing Flow The program written in OpenCL would look broadly similar to a CUDA program. 31

32 Simple CUDA Program 32

33 Simple CUDA Program The routine square( ) is executed by the GPU. Each hardware thread on the GPU executes the same routine. The routine main( ) is executed by the host processor. Allocate memory on the host. Allocate memory on the GPU. 33

34 Simple CUDA Program Copy the data from the host to The GPU. The bus bandwidth may be 8GB/s to 16GB/s. Threads are arranged in groups called blocks. The # of threads per block and the # of blocks are specified. Once the call to square() completes, the host copies the resulting data back from the device into host memory. 34

35 35