DESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors

Similar documents
Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Programming Support for Heterogeneous Parallel Systems

Understanding the Performance of Concurrent Data Structures on Graphics Processors

Allocating memory in a lock-free manner

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Concurrent Preliminaries

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Progress Guarantees When Composing Lock-Free Objects

Håkan Sundell University College of Borås Parallel Scalable Solutions AB

Efficient and Reliable Lock-Free Memory Reclamation Based on Reference Counting

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

Programmability and Performance Portability for Heterogeneous Many-Core Systems

Non-blocking Array-based Algorithms for Stacks and Queues!

A High Integrity Distributed Deterministic Java Environment. WORDS 2002 January 7, San Diego CA

Threading and Synchronization. Fahd Albinali

Final Exam Preparation Questions

Resource management. Real-Time Systems. Resource management. Resource management

Real-Time Systems. Lecture #4. Professor Jan Jonsson. Department of Computer Science and Engineering Chalmers University of Technology

Remaining Contemplation Questions

CUDA GPGPU Workshop 2012

Synchronising Threads

Shared Memory Programming Models I

Last Class: Synchronization

Process Synchronisation (contd.) Operating Systems. Autumn CS4023

Concept of a process

CS420: Operating Systems. Process Synchronization

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Dealing with Issues for Interprocess Communication

Semaphore. Originally called P() and V() wait (S) { while S <= 0 ; // no-op S--; } signal (S) { S++; }

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

Operating Systems EDA092, DIT 400 Exam

GPU Fundamentals Jeff Larkin November 14, 2016

Cache Coherence and Atomic Operations in Hardware

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Systèmes d Exploitation Avancés

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

Linked Lists: The Role of Locking. Erez Petrank Technion

CS 241 Honors Concurrent Data Structures

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

a. A binary semaphore takes on numerical values 0 and 1 only. b. An atomic operation is a machine instruction or a sequence of instructions

Preemptive Scheduling and Mutual Exclusion with Hardware Support

Synchronization for Concurrent Tasks

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Application Programming

Chapter 5: Process Synchronization. Operating System Concepts 9 th Edition

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Synchronization. Race Condition. The Critical-Section Problem Solution. The Synchronization Problem. Typical Process P i. Peterson s Solution

Martin Kruliš, v

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 6: Synchronization. Operating System Concepts 8 th Edition,

The University of Texas at Arlington

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Multiprocessors and Locking

The University of Texas at Arlington

Student Name:.. Student ID... Course Code: CSC 227 Course Title: Semester: Fall Exercises Cover Sheet:

CS5460: Operating Systems

IT 540 Operating Systems ECE519 Advanced Operating Systems

Chapter 5 Asynchronous Concurrent Execution

SYNCHRONIZATION M O D E R N O P E R A T I N G S Y S T E M S R E A D 2. 3 E X C E P T A N D S P R I N G 2018

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Portland State University ECE 588/688. Graphics Processors

M4 Parallelism. Implementation of Locks Cache Coherence

CS6235. L15: Dynamic Scheduling

CSE 120 Principles of Operating Systems

Concurrency: Mutual Exclusion and

CPSC/ECE 3220 Summer 2017 Exam 2

Introduction to OS Synchronization MOS 2.3

Threads. Concurrency. What it is. Lecture Notes Week 2. Figure 1: Multi-Threading. Figure 2: Multi-Threading

Advance Operating Systems (CS202) Locks Discussion

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Computer Systems Engineering: Spring Quiz I Solutions

Synchronization Spinlocks - Semaphores

Process Co-ordination OPERATING SYSTEMS

A low memory footprint OpenCL simulation of short-range particle interactions

Introduction to Multicore Programming

RCU in the Linux Kernel: One Decade Later

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CS377P Programming for Performance Multicore Performance Synchronization

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Towards a Software Transactional Memory for Graphics Processors

Trends and Challenges in Multicore Programming

Tesla GPU Computing A Revolution in High Performance Computing

Marwan Burelle. Parallel and Concurrent Programming

QUESTION BANK UNIT I

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

G52CON: Concepts of Concurrency

CS 5523 Operating Systems: Midterm II - reivew Instructor: Dr. Tongping Liu Department Computer Science The University of Texas at San Antonio

Concurrency. Chapter 5

Parallel Programming: Background Information

Parallel Programming: Background Information

Linux Driver and Embedded Developer

Review: Easy Piece 1

XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization

CS533 Concepts of Operating Systems. Jonathan Walpole

Dept. of CSE, York Univ. 1

Transcription:

DESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors DIMACS March 15 th, 2011 Philippas Tsigas

Data Structures In Manycore Sys. Decomposition Synchronization Load Balancing Inter Task Communication

Concurrent Data Structures: Our NOBLE Library project Fundamental shared data structures Stacks Queues Deques Priority Queues Memory Management Memory allocation Memory reclamation (garbage collection) Atomic primitives Dictionaries Linked Lists Snapshots Single-word and Multi-word transactions.

Many-core? No clear definition, but at least more than 10 cores Some say thousands Dual-, Quad-, Hexa-, Octo-, Dekaexi-? Trianta dyo-? Many-core! The most commonly available many-core platforms are the medium to high end graphics processors Have up to 30 multiprocessors and available at a low-cost CUDA and OpenCL have made them easily accessible Framework for parallel computing Computing platform developed by NVIDIA

A Basic Comparison Normal processors Large cache Few threads Graphics processors Small/No cache SIMD Wide and fast memory bus with memory operations that can be coalesced Thousands of threads masks memory latency

CUDA System Model Global Memory Multiprocessor 0 Multiprocessor 1 Thread Block 0 Thread Block 1 Thread Block x Thread Block y Shared Memory Shared Memory

CUDA System Model Atomic primitives None -> For global memory -> For shared memory Threads per multiprocessor 768 -> 1024 -> 1536 Shared memory 16KB -> 48KB SIMD width 8 words -> 32 words

Locks are not supported Not in CUDA, not in OpenCL Fairness of hardware scheduler unknown Thread block holding a lock might be swapped out indefinitely, for example

No Fairness Guarantees while(atomiccas(&lock,0,1)); ctr++; lock = 0; Thread holding lock is never scheduled!

Lock-free Data Structures Mutual exclusion (Semaphores, mutexes, spin-locks, disabling interrupts: Protects critical sections) Locks limits concurrency, priority inversion Busy waiting repeated checks to see if lock has been released or not Convoying processes stack up before locks Blocking Locks are not composable All code that accesses a piece of shared state must know and obey the locking convention, regardless of who wrote the code or where it resides. Lock-freedom is a progress guarantee In practice it means that A fast process doesn t have to wait for a slow or dead process No deadlocks Shown to scale better than blocking approaches Definition For all possible executions, at least one concurrent operation will succeed in a finite number of its own steps

A Lock-free Implementation of a Counter In this case a non-blocking design is easy: class Counter { int next = 0; } int getnumber () { int t; do { Atomic compare and swap t = next; } while (CAS (&next, t, t + 1)!= t); return t; New value } Expected value Location 11

Lock Free Concurrent Data Structures Doubly Linked Lists H 1 5 8 T Hashtables, Dictionaries Skiplists Queues, Priority, Deques LF DS in Normal Processors: Joint work with D. Cederman, A. Gidenstamn, Ph. Ha, M. Papatriantafilou, H. Sundell, Y. Zhang Graphics Processors: Joint work D. Cederman, Ph. Ha, O. Anshus

A Basic Comparison Normal processors Graphics processors Large cache Few threads Atomics (CAS, ) Small/No cache SIMD Wide and fast memory bus with memory operations that can be coalesced Thousands of threads masks memory latency No atomics -> -> expensive ones

Lock-free Data Structures Without Atomics? Emulating CAS from Coalesced Memory Access

Aligned-inconsecutive word (aiword) Memory is aligned to m-unit words, m is a constant. m-aiword for short A read/write operation accesses an arbitrary non-empty subset of the m units of an aiword. m-aiwrite = m-aiword assignment. Alignment restriction m-aiwords must start at addresses that are multiples of m. Ex: 8-aiwrite 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8-aiword 8-aiwrite 8-aiword

Binary consensus (BC) for 4+1 processes p 0,p 1, p 2,p 3 p 4 writing schema p 0 p 1 p 2 p 3 0 1 2 3 4 5 6 7 8 p 4 Consensus for 5 processes p 0 p 1 p 2 p 3 p 4 BC p 0,p 1 p 0,p 1,p 2,p 3 p 0 p 1 p 2 p 3 0 1 2 3 4 5 6 7 8 p 4 [0,4,8] p 4 p 0 [1,5,8] p 1 p 4 [2,6,8] p 4 p 2 [3,7,8] p 4 p 3 red wrote first BC p 0,p 1,p 2,p 3,p 4 time

Hardware Primitives are Significant for Concurrent Data Structures 250 200 Time 150 100 CAS MemAcc 50 0 1 30 60 Threads

Concurrent Data Structures Need Scalable Strong Synchronization Primitives Desired Features Scalable Universal powerful enough to support any kind of synchronization (like CAS, LL/SC) Feasible Easy to implement in hardware Easy-to-use in Algorithmic Design

Non-blocking Full/Empty Bit Joined work with Phuong Ha and Otto Anshus

Non-blocking Full/Empty Bit Combinable operations Universal Feasible Slight modification of a primitive that has been implemented in hardware Easy-to-use

A variant of the original FEB that always returns a value instead of waiting for a conditional flag Test-Flag-and-Set TFAS( x, v) { (o, flag o ) (x, flag x ); if flag x = false then (x, flag x ) (v, true); end if return (o, flag o ); } Original FEB: Store-if- Clear-and-Set SICAS(x,v) { Wait for flag x to be false; (x, flag x ) (v, true); }

A variant of the original FEB that always returns a value instead of waiting for a conditional flag Test-Flag-and-Set TFAS( x, v) { (o, flag o ) (x, flag x ); if flag x = false then (x, flag x ) (v, true); end if return (o, flag o ); } Load Load( x) { return (x, flag x ); } Store-And-Clear SAC( x, v) { (o, flag o ) (x, flag x ); (x, flag x ) (v, false); return (o, flag o ); } Store-And-Set SAS( x, v) { (o, flag o ) (x, flag x ); (x, flag x ) (v, true); return (o, flag o ); }

Combinability Key idea: Combinability eliminates contention & reduce load Ex: TFAS TFAS(x,1) TFAS(x,2) TFAS(x,3) TFAS(x,4) 1 1 1 TFAS(x,1) TFAS(x,3) 1 x= TFAS(x,1) x=1 Note: CAS or LL/SC is not combinable

Conclusions New algorithmic techniques that come from the introduction of new hardware features. Core algorithmic design did not change when going from GP CPU to GPU. Optimistic concurrency control works in manycore systems. Hard to derive worst case guarantees. Scheduler part of the reference model? Need to start a discussion with the architects about the abstractions/primitives that we want/need.

PEPPHER: PERFORMANCE PORTABILITY AND PROGRAMMABILITY FOR HETEROGENEOUS MANY-CORE ARCHITECTURES This project is part of the portfolio of the G.3 - Embedded Systems and Control Unit www.peppher.eu Information Society and Media Directorate-General European Commission Copyright 2010 The PEPPHER Consortium Contract Number: 248481 Total Cost [ ]: 3.44 million Starting Date: 2010-01-01 Duration: 36 months

Project Consortium University of Vienna (Coordinator), Austria Siegfried Benkner, Sabri Pllana and Jesper Larsson Träff Chalmers University, Sweden Philippas Tsigas Codeplay Software Ltd., UK Andrew Richards INRIA, France Raymond Namyst Intel GmbH, Germany Herbert Cornelius Linköping University, Sweden Christoph Kessler Movidius Ltd. Ireland David Moloney Karlsruhe Institute of Technology, Germany Peter Sanders Author(s) 28

Programmability and Performance Portability Programmability Composition & Coordination Design Patterns, Component Based Application Design Annotations Runtime/Schedule/Sel ection Hints, Data Information, Performance Portability Library Compilation Adaptive Algorithms & Data Structures, Auto- Tunable Components Automatic Variant Generation, Variant Selection, Tuning Runtime Criteria Specific Scheduling, Resource Efficient Runtime 29

Thank You!