Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming

Similar documents
DESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors

Allocating memory in a lock-free manner

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Concurrent Preliminaries

Dealing with Issues for Interprocess Communication

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Håkan Sundell University College of Borås Parallel Scalable Solutions AB

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

CHAPTER 6: PROCESS SYNCHRONIZATION

Synchronization I. Jo, Heeseung

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Cache Coherence and Atomic Operations in Hardware

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

Remaining Contemplation Questions

Preemptive Scheduling and Mutual Exclusion with Hardware Support

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Non-blocking Array-based Algorithms for Stacks and Queues!

Order Is A Lie. Are you sure you know how your code runs?

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Concurrent & Distributed Systems Supervision Exercises

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Multiprocessors and Locking

RCU in the Linux Kernel: One Decade Later

IT 540 Operating Systems ECE519 Advanced Operating Systems

Concurrency, Thread. Dongkun Shin, SKKU

Design of Concurrent and Distributed Data Structures

CS370 Operating Systems Midterm Review

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

CS533 Concepts of Operating Systems. Jonathan Walpole

Today: Synchronization. Recap: Synchronization

Synchronization COMPSCI 386

CS420: Operating Systems. Process Synchronization

Threading and Synchronization. Fahd Albinali

CS 153 Design of Operating Systems Winter 2016

Interprocess Communication By: Kaushik Vaghani

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

Process Synchronisation (contd.) Operating Systems. Autumn CS4023

Parallel Programming: Background Information

Parallel Programming: Background Information

Scheduler Activations. CS 5204 Operating Systems 1

Last Class: Synchronization

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

CS604 - Operating System Solved Subjective Midterm Papers For Midterm Exam Preparation

Computer Architecture

Synchronization for Concurrent Tasks

Concurrent Objects and Linearizability

Review: Easy Piece 1

CSE 120 Principles of Operating Systems

Lecture #7: Implementing Mutual Exclusion

Linked Lists: The Role of Locking. Erez Petrank Technion

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.

Intel Thread Building Blocks, Part IV

ECE 574 Cluster Computing Lecture 8

CS5460: Operating Systems

Processes and Threads. Processes: Review

ECE 462 Object-Oriented Programming using C++ and Java. Scheduling and Critical Section

G Programming Languages Spring 2010 Lecture 13. Robert Grimm, New York University

CMSC421: Principles of Operating Systems

Main Points of the Computer Organization and System Software Module

Synchronization I. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Diagram of Process State Process Control Block (PCB)

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Operating Systems Overview. Chapter 2

Chapter 5 Concurrency: Mutual Exclusion. and. Synchronization. Operating Systems: Internals. and. Design Principles

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

Operating Systems Overview. Chapter 2

Concurrency. Chapter 5

Processes and Non-Preemptive Scheduling. Otto J. Anshus

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

CS 134: Operating Systems

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

Synchronization. Heechul Yun

What is the Race Condition? And what is its solution? What is a critical section? And what is the critical section problem?

MS Windows Concurrency Mechanisms Prepared By SUFIAN MUSSQAA AL-MAJMAIE

Operating Systems EDA092, DIT 400 Exam

Chapter 5 Asynchronous Concurrent Execution

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

POSIX Threads: a first step toward parallel programming. George Bosilca

Fine-grained synchronization & lock-free programming

Memory Consistency Models

Java theory and practice: Going atomic The new atomic classes are the hidden gems of java.util.concurrent

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Synchronising Threads

ECE519 Advanced Operating Systems

TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures

Processes Prof. James L. Frankel Harvard University. Version of 6:16 PM 10-Feb-2017 Copyright 2017, 2015 James L. Frankel. All rights reserved.

Process Synchronisation (contd.) Deadlock. Operating Systems. Spring CS5212

Concept of a process

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

CSCI 447 Operating Systems Filip Jagodzinski

Predictable Interrupt Management and Scheduling in the Composite Component-based System

Mutex Implementation

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Concurrent specifications beyond linearizability

Lecture 5: Synchronization w/locks

Transcription:

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department Philippas Tsigas

WHY PARALLEL PROGRAMMING IS ESSSENTIAL IN DISTRIBUTED SYSTEMS AND NETWORKING Philippas Tsigas 2

How did we reach there? Picture from Pat Gelsinger, Intel Developer Forum, Spring 2004 (Pentium at 90W) tsigas@cs.chalmers.se Philippas Tsigas 3

Concurrent Software Becomes Essential 1 Core 1) Scalability becomes an issue for all software. 24GHz 2) Modern software development relies on the ability to compose libraries into larger programs. 12GHz 6GHz 3GHz 3GHz 2 Cores 3GHz 4 Cores 3GHz 8 Cores Our work is to help the programmer to develop efficient parallel programs but also survive the multicore transition. Philippas Tsigas 4

DISTRIBUTED APPLICATIONS Philippas Tsigas 5

Distributed Applications Demand Quite High Level Data Sharing: Commercial computing (media and information processing) Control Computing (on board flight-control system) Philippas Tsigas 6

Data Sharing: Gameplay Simulation as an example This is the hardest problem 10,000 s of objects Each one contains mutable state Each one updated 30 times per second Each update touches 5-10 other objects Manual synchronization (shared state concurrency) is hopelessly intractable here. Solutions? Slide: Tim Sweeney CEO Epic Games POPL 2006 Philippas Tsigas 7

NETWORKING Philippas Tsigas 8

40 multithreaded packet-processing engines http://www.cisco.com/assets/cdc_content_elements/ embedded-video/routers/popup.html On chip, there are 40 32-bit, 1.2-GHz packet-processing engines. Each engine works on a packet from birth to death within the Aggregation Services Router. each multithreaded engine handles four threads (each thread handles one packet at a time) so each QuantumFlow Processor chip has the ability to work on 160 packets concurrently Philippas Tsigas 9

DATA SHARING Philippas Tsigas 10

Data Sharing: Gameplay Simulation as an example This is the hardest problem 10,000 s of objects Each one contains mutable state Each one updated 30 times per second Each update touches 5-10 other objects Manual synchronization (shared state concurrency) is hopelessly intractable here. Solutions? Slide: Tim Sweeney CEO Epic Games POPL 2006 Philippas Tsigas 11

Blocking Data Sharing A typical Counter Impl: class Counter { int next = 0; synchronized int getnumber () { int t; t = next; next = t + 1; return t; } } next = 01 2 Thread1: getnumber() t = 0 result=0 Lock acquired Thread2: getnumber() Lock released result=1 tsigas@cs.chalmers.se Philippas Tsigas 12

Do we need Synchronization? What can go wrong here? class Counter { int next = 0; int getnumber () { int t; t = next; next = t + 1; return t; } } next = 01 Thread1: getnumber() t = 0 result=0 Thread2: getnumber() t = 0 result=0 tsigas@cs.chalmers.se Philippas Tsigas 13

Blocking Synchronization = Sequential Behavior Philippas Tsigas 14

BS ->Priority Inversion A high priority task is delayed due to a low priority task holding a shared resource. The low priority task is delayed due to a medium priority task executing. Solutions: Priority inheritance protocols Works ok for single processors, but for multiple processors Task H: Task M: Task L: Philippas Tsigas 15

Critical Sections + Multiprocessors Reduced Parallelism. Several tasks with overlapping critical sections will cause waiting processors to go idle. Task 1: Task 2: Task 3: Task 4: Philippas Tsigas 16

The BIGEST Problem with Locks? Blocking Locks are not composable All code that accesses a piece of shared state must know and obey the locking convention, regardless of who wrote the code or where it resides. Philippas Tsigas 17

Interprocess Synchronization = Data Sharing Synchronization is required for concurrency Mutual exclusion (Semaphores, mutexes, spin-locks, disabling interrupts: Protects critical sections) - Locks limits concurrency - Busy waiting repeated checks to see if lock has been released or not - Convoying processes stack up before locks - Blocking Locks are not composable - All code that accesses a piece of shared state must know and obey the locking convention, regardless of who wrote the code or where it resides. A better approach is not to lock 18

A Lock-free Implementation tsigas@cs.chalmers.se Philippas Tsigas 19

How did it start? Synchronization is an enforcing mechanism used to impose constraints on the order of execution of threads.... Synchronization is used to coordinate threads execution and manage shared data. Does it have to be like that? When we share data do we have to impose constraints on the execution of threads?

HOW SAFE IS IT: LET US START FROM THE BEGINING Philippas Tsigas 21

Shared Abstract Data Types Object in memory - Supports some set of operations (ADT) - Concurrent access by many processes/threads - Useful to e.g. Exchange data between threads Coordinate thread activities Op B Op A P1 P2 P3 P4 22

Executing Operations invocation response P 1 P 2 P 3 Borrowed from H. Attiya 23

Interleaving Operations Concurrent execution 24

Interleaving Operations (External) behavior 25 25

Interleaving Operations, or Not Sequential execution 26 26

Interleaving Operations, or Not Sequential behavior: invocations & response alternate and match (on process & object) Sequential specification: All the legal sequential behaviors, satisfying the semantics of the ADT - E.g., for a (LIFO) stack: pop returns the last item pushed 27

Correctness: Sequential consistency For every concurrent execution there is a sequential execution that - Contains the same operations - Is legal (obeys the sequential specification) - Preserves the order of operations by the same process [Lamport, 1979] 28

Sequential Consistency: Examples Concurrent (LIFO) stack push(7) push(4) pop():4 Last In First Out push(4) push(7) pop():4 29 29

Sequential Consistency: Examples Concurrent (LIFO) stack push(7) push(4) pop():7 Last In First Out 30 30

Safety: Linearizability Linearizable ADTs - Sequential specification defines legal sequential executions - Concurrent operations allowed to be interleaved - Operations appear to execute atomically External observer gets the illusion that each operation takes effect instantaneously at some point between its invocation and its response(preserves order of all push(4) operation) T 1 push(7) pop():4 concurrent LIFO stack time T 2 Last In First Out 31

Safety II An accessible node is never freed. 32

Liveness Non-blocking implementations - Wait-free implementation of an ADT [Lamport, 1977] Every operation finishes in a finite number of its own steps. - Lock-free ( FREE of LOCKS) implementation [Lamport, 1977] At least one operation (from a set of concurrent operation) finishes in a finite number of steps (the data structure as a system always make progress) 33

Liveness II every garbage node is eventually collected 34

Abstract Data Types (ADT) Cover most concurrent applications At least encapsulate their data needs An object-oriented programming point of view Abstract representation of data & set of methods (operations) for accessing it Signature Specification data 35

Implementing High-Level ADT Using lower-level ADTs & procedures ------------------ ------------------- ------------------ ---------------- ---------------- --------------- ------------------ ------------------- data data 36

Lower-Level Operations High-level operations translate into primitives on base objects that are available on H/W Obvious: read, write Common: compare&swap (CAS), LL/SC, FAA 37

CAN I FIND A JOB IF I STUDY THIS? Philippas Tsigas 38

8 Feb 2002 Release of NOBLE version 1.0 23 Jan 2002 Expert Group Formation (JSR: Java Concurrency Utilities) 8 Jan 2004 JSR first Release 29 Aug 2006 INTEL s TBB release 1.0

ERLANG OTP_R15A: R15 pre-release Written by Kenneth, 23 Nov 2011 We have recently pushed a new master to GitHub tagged OTP_R15A. This is a stabilized snapshot of the current R15 development (to be released as R15B on December 14:th) which, among other things, includes: OTP-9468 'Line numbers in exceptions' OTP-9451 'Parallel make' OTP-4779 A new GUI for Observer. Integrating pman, etop and tv into observer with tracing facilities. OTP-7775 A number of memory allocation optimizations have been implemented. Most optimizations reduce contention caused by synchronization between threads during allocation and deallocation of memory. Most notably: Synchronization of memory management in scheduler specific allocator instances has been rewritten to use lock-free synchronization. Synchronization of memory management in scheduler specific pre-allocators has been rewritten to use lock-free synchronization. The 'mseg_alloc' memory segment allocator now use scheduler specific instances instead of one instance. Apart from reducing contention this also ensures that memory allocators always create memory segments on the local NUMA node on a NUMA system. Philippas Tsigas OTP-9632 An ERTS internal, generic, many to one, lock-free queue for communication between threads has been introduced. The many to one scenario is very common in ERTS, so it can be used in a lot of places in the future. Currently it is used by scheduling of certain jobs, and the async thread pool, but more uses are planned for the future. Drivers using the driver_async functionality are not automatically locked to the system anymore, and can be unloaded as any dynamically linked in driver. Scheduling of ready async jobs is now also interleaved in between other jobs. Previously all ready async jobs were performed at once. OTP-9631 The ERTS internal system block functionality has been replaced by new functionality for blocking the system. The old system block functionality had contention issues and complexity issues. The new functionality piggy-backs on thread progress tracking functionality needed by newly introduced lock-free synchronization in the runtime system. When the functionality for blocking the system isn't used, there is more or less no overhead at all. This since the functionality for tracking thread progress is there and needed anyway.... and much much more. This is not a full release of R15 but rather a pre-release. Feel free to try our R15A release and get back to us with your findings. Your feedback is important to us and highly welcomed. Regards, The OTP Team 40

Philippas Tsigas 41

Philippas Tsigas 42

Locks are not supported Not in CUDA, not in OpenCL Fairness of hardware scheduler unknown Thread block holding a lock might be swapped out indefinitely, for example

No Fairness Guarantees while(atomiccas(&lock,0,1)); ctr++; lock = 0; Thread holding lock is never scheduled!

Where do we stand at?