Concurrent programming: From theory to practice. Concurrent Algorithms 2016 Tudor David
|
|
- Sophie Mason
- 5 years ago
- Views:
Transcription
1 oncurrent programming: From theory to practice oncurrent Agorithms 2016 Tudor David
2 From theory to practice Theoretica (design) Practica (design) Practica (impementation) 2
3 From theory to practice Theoretica (design) Practica (design) Practica (impementation) Impossibiities Upper/Lower bounds Techniques System modes orrectness proofs orrectness Design (pseudo-code) 3
4 From theory to practice Theoretica (design) Practica (design) Practica (impementation) Impossibiities Upper/Lower bounds Techniques System modes orrectness proofs orrectness System modes shared memory message passing Finite memory Practicaity issues re-usabe objects Performance Design (pseudo-code) Design (pseudo-code, prototype) 4
5 From theory to practice Theoretica (design) Practica (design) Practica (impementation) Impossibiities Upper/Lower bounds Techniques System modes orrectness proofs orrectness System modes shared memory message passing Finite memory Practicaity issues re-usabe objects Performance Hardware Which atomic ops Memory consistency ache coherence Locaity Performance Scaabiity Design (pseudo-code) Design (pseudo-code, prototype) Impementation (code) 5
6 Exampe: inked ist impementations Throughput (Mop/s) # ores pessimistic "bad" inked ist "good" inked ist 6
7 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 7
8 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 8
9 Why do we use caching? ore ore freq: 2GHz = 0.5 ns / instr ore Disk = ~ms Disk 9
10 Why do we use caching? ore ore freq: 2GHz = 0.5 ns / instr ore Disk = ~ms ore Memory = ~100ns Memory Disk 10
11 Why do we use caching? ore ache Memory ore freq: 2GHz = 0.5 ns / instr ore Disk = ~ms ore Memory = ~100ns ache - Large = sow - Medium = medium - Sma = fast Disk 11
12 Why do we use caching? ore L3 Memory ore freq: 2GHz = 0.5 ns / instr ore Disk = ~ms ore Memory = ~100ns ache - ore L3 = ~20ns - ore = ~7ns - ore = ~1ns Disk 12
13 Typica server configurations Inte Xeon AMD Opteron GHz GHz - : 32KB - : 64KB - : 256KB - : 512KB - L3: 24MB - L3: 12MB - Memory: 256GB - Memory: 256GB 13
14 Experiment Throughput of accessing some memory, depending on the memory size 14
15 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 15
16 Unti ~2004: Singe-cores ore Memory ore freq: 3+GHz ore Disk ore Memory ache - ore L3 - ore - ore Disk 16
17 After ~2004: Muti-cores ore 0 ore 1 ore freq: ~2GHz ore Disk ore Memory ache L3 - ore shared L3 - ore Memory - ore Disk 17
18 Muti-cores with private caches ore 0 ore 1 Private = mutipe copies L3 Memory Disk 18
19 ache coherence for consistency ore 0 X ore 1 ore 0 has X and ore 1 - wants to write on X - wants to read X - did ore 0 write or read X? L3 Memory Disk 19
20 ache-coherence principes ore 0 ore 1 To perform a write X - invaidate a readers, or - previous writer To perform a read - find the atest copy L3 Memory Disk 20
21 ache coherence with MESI A state diagram State (per cache ine) - Modified: the ony dirty copy - Excusive: the ony cean copy - Shared: a cean copy - Invaid: useess data 21
22 The utimate goa for scaabiity Possibe states - Modified: the ony dirty copy - Excusive: the ony cean copy - Shared: a cean copy - Invaid: useess data Which state is our favorite? 22
23 The utimate goa for scaabiity Possibe states - Modified: the ony dirty copy - Excusive: the ony cean copy -Shared: a cean copy - Invaid: useess data = threads can keep the data cose ( cache) = faster 23
24 Experiment The effects of fase sharing 24
25 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 25
26 Uniformity vs. non-uniformity Typica desktop machine Memory aches = Uniform Typica server machine Memory aches aches Memory = non-uniform (NUMA) 26
27 Latency (ns) to access data Memory Memory L3 L3 27
28 Latency (ns) to access data Memory 1 Memory L3 L3 28
29 Latency (ns) to access data Memory 1 Memory 7 L3 L3 29
30 Latency (ns) to access data Memory 1 Memory 7 20 L3 L3 30
31 Latency (ns) to access data Memory 1 40 Memory 7 20 L3 L3 31
32 Latency (ns) to access data Memory Memory 7 20 L3 L3 32
33 Latency (ns) to access data Memory Memory 7 20 L3 L3 33
34 Latency (ns) to access data Memory Memory 20 L3 L3 34
35 Latency (ns) to access data Memory Memory 20 L3 oncusion: we need to take care of ocaity L3 35
36 Experiment The effects of ocaity 36
37 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 37
38 The Programmer s Toobox: Hardware synchronization instructions Depends on the processor AS generay provided J TAS and atomic increment not aways provided x86 processors (Inte, AMD): Atomic exchange, increment, decrement provided Memory barrier aso avaiabe Inte as of 2014 provides transactiona memory 38
39 Exampe: Atomic ops in G type sync_fetch_and_op(type *ptr, type vaue); type sync_op_and_fetch(type *ptr, type vaue); // OP in {add,sub,or,and,xor,nand} type sync_va_compare_and_swap(type *ptr, type odva, type newva); boo sync_boo_compare_and_swap(type *ptr, type odva, type newva); sync_synchronize(); // memory barrier 39
40 Inte s transactiona synchronization extensions (TSX) 1. Hardware ock eision (HLE) Instruction prefixes: XAQUIRE XRELEASE Exampe (G): he_{acquire,reease}_compare_exchange_n{1,2,4,8} Try to execute critica sections without acquiring/reeasing the ock If confict detected, abort and acquire the ock before re-doing the work 40
41 Inte s transactiona synchronization extensions (TSX) 2. Restricted Transactiona Memory (RTM) _xbegin(); _xabort(); _xtest(); _xend(); Limitations: Not starvation free Transactions can be aborted various reasons Shoud have a non-transactiona back-up Limited transaction size 41
42 Inte s transactiona synchronization extensions (TSX) 2. Restricted Transactiona Memory (RTM) Exampe: if (_xbegin() == _XBEGIN_STARTED){ counter = counter + 1; _xend(); } ese { sync_fetch_and_add(&counter,1); } 42
43 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 43
44 oncurrent agorithm correctness Designing correct concurrent agorithms: 1. Theoretica part 2. Practica part à invoves impementation The processor and the compier optimize assuming no concurrency! L 44
45 The memory consistency mode //A, B shared variabes, initiay 0; //r1, r2 oca variabes; P1 P2 A = 1; B = 1; r1 = B; r2 = A; What vaues can r1 and r2 take? (assume x86 processor) Answer: (0,1), (1,0), (1,1) and (0,0) 45
46 The memory consistency mode à The order in which memory instructions appear to execute What woud the programmer ike to see? Sequentia consistency A operations executed in some sequentia order; Memory operations of each thread in program order; Intuitive, but imits performance; 46
47 The memory consistency mode How can the processor reorder instructions to different memory addresses? x86 (Inte, AMD): TSO variant Reads not reordered w.r.t. reads Writes not reordered w.r.t writes Writes not reordered w.r.t. reads Reads may be reordered w.r.t. writes to different memory addresses //A,B, //gobas int x,y,z; x = A; y = B; B = 3; A = 2; y = A; = 4; z = B; 47
48 The memory consistency mode Singe thread reorderings transparent; Avoid reorderings: memory barriers x86 impicit in atomic ops; voatie in Java; Expensive - use ony when reay necessary; Different processors different memory modes e.g., ARM reaxed memory mode (anything goes!); VMs (e.g. JVM, LR) have their own memory modes; 48
49 Beware of the compier void ock(int * some_ock) { whie (AS(some_ock,0,1)!= 0) {} asm voatie( ::: memory ); //compier barrier } void unock(int * some_ock) { asm voatie( ::: memory ); //compier barrier *some_ock = 0; } voatie int the_ock=0; ock(&the_ock); unock(&the_ock); voatie!= Java voatie The compier can: reorder instructions remove instructions not write vaues to memory 49
50 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 50
51 oncurrent Programming Techniques What techniques can we use to speed up our concurrent appication? Main idea: Minimize contention on cache ines Use case: Locks acquire() = ock() reease() = unock() 51
52 TAS The simpest ock Test-and-Set Lock typedef voatie uint ock_t; void acquire(ock_t * some_ock) { whie (TAS(some_ock)!= 0) {} asm voatie( ::: memory ); } void reease(ock_t * some_ock) { asm voatie( ::: memory ); *some_ock = 0; } 52
53 How good is this ock? A simpe benchmark Have 48 threads continuousy acquire a ock, update some shared data, and unock Measure how many operations we can do in a second Test-and-Set ock: 190K operations/second 53
54 How can we improve things? Avoid cache-ine ping-pong: Test-and-Test-and-Set Lock void acquire(ock_t * some_ock) { whie(1) { whie (*some_ock!= 0) {} if (TAS(some_ock) == 0) { return; } } asm voatie( ::: memory ); } void reease(ock_t * some_ock) { asm voatie( ::: memory ); *some_ock = 0; } 54
55 Performance comparison Ops/second (thousands) Test-and-Set Test-and-Test-and-Set 55
56 But we can do even better Avoid thundering herd: Test-and-Test-and-Set with Back-off void acquire(ock_t * some_ock) { uint backoff = INITIAL_BAKOFF; whie(1) { whie (*some_ock!= 0) {} if (TAS(some_ock) == 0) { return; } ese { ock_seep(backoff); backoff=min(backoff*2,maximum_bakoff); } } asm voatie( ::: memory ); } void reease(ock_t * some_ock) { asm voatie( ::: memory ); *some_ock = 0; } 56
57 Performance comparison Ops/second (thousands) Test-and-Set Test-and-Test-and-Set Test-and-Test-and-Set w. backoff 57
58 Are these ocks fair? Processed requests per thread, Test-and-Set ock Number of processed requests Thread number 58
59 What if we want fairness? Use a FIFO mechanism: Ticket Locks typedef ticket_ock_t { voatie uint head; voatie uint tai; } ticket_ock_t; void acquire(ticket_ock_t * a_ock) { uint my_ticket = fetch_and_inc(&(a_ock->tai)); whie (a_ock->head!= my_ticket) {} asm voatie( ::: memory ); } void reease(ticket_ock_t * a_ock) { asm voatie( ::: memory ); a_ock->head++; } 59
60 What if we want fairness? Processed requests per thread, Ticket Locks Number of processed requests Thread number 60
61 Performance comparison Ops/second (thousands)
62 an we back-off here as we? Yes, we can: Proportiona back-off void acquire(ticket_ock_t * a_ock) { uint my_ticket = fetch_and_inc(&(a_ock->tai)); uint distance, current_ticket; whie (1) { current_ticket = a_ock->head; if (current_ticket == my_ticket) break; distance = my_ticket current_ticket; if (distance > 1) ock_seep(distance * BASE_SLEEP); } asm voatie( ::: memory ); } void reease(ticket_ock_t * a_ock) { asm voatie( ::: memory ); a_ock->head++; } 62
63 Performance comparison Ops/second (thousands)
64 Sti, everyone is spinning on the same variabe. Use a different address for each thread: Queue Locks 1 run eaving 2 run spin 3 spin 4 arriving spin Use with care: 1. storage overheads 2. compexity 64
65 Performance comparison Ops/second (thousands)
66 To summarize on ocks 1. Reading before trying to write 2. Pausing when it s not our turn 3. Ensuring fairness (does not aways bring ++) 4. Accessing disjoint addresses (cache ines) More than 10x performance gain! 66
67 oncusion oncurrent agorithm design Theoretica design Practica design (may be just as important) Impementation You need to know your hardware For correctness For performance 67
Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis
oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to
More informationCSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Synchronization: Semaphore
CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Synchronization: Synchronization Needs Two synchronization needs Mutua excusion Whenever mutipe threads access a shared data, you need to worry
More informationCS5460/6460: Operating Systems. Lecture 14: Scalability techniques. Anton Burtsev February, 2014
CS5460/6460: Operating Systems Lecture 14: Scalability techniques Anton Burtsev February, 2014 Recap: read and write barriers void foo(void) { a = 1; smp_wmb(); b = 1; } void bar(void) { while (b == 0)
More informationCSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling
CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Scheduing Announcement Homework 2 due on October 25th Project 1 due on October 26th 2 CSE 120 Scheduing and Deadock Scheduing Overview In discussing
More informationCSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Midterm Review
CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Midterm Review Overview The midterm Architectura support for OSes OS modues, interfaces, and structures Processes Threads Synchronization Scheduing
More informationHardware Transactional Memory on Haswell
Hardware Transactional Memory on Haswell Viktor Leis Technische Universität München 1 / 15 Introduction transactional memory is a very elegant programming model transaction { transaction { a = a 10; c
More informationCSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Lecture 4: Threads
CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Lecture 4: Threads Announcement Project 0 Due Project 1 out Homework 1 due on Thursday Submit it to Gradescope onine 2 Processes Reca that
More informationThe need for atomicity This code sequence illustrates the need for atomicity. Explain.
Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock
More informationCOS 318: Operating Systems. Virtual Memory Design Issues: Paging and Caching. Jaswinder Pal Singh Computer Science Department Princeton University
COS 318: Operating Systems Virtua Memory Design Issues: Paging and Caching Jaswinder Pa Singh Computer Science Department Princeton University (http://www.cs.princeton.edu/courses/cos318/) Virtua Memory:
More informationCSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)
CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that
More informationl Tree: set of nodes and directed edges l Parent: source node of directed edge l Child: terminal node of directed edge
Trees & Heaps Week 12 Gaddis: 20 Weiss: 21.1-3 CS 5301 Fa 2016 Ji Seaman 1 Tree: non-recursive definition Tree: set of nodes and directed edges - root: one node is distinguished as the root - Every node
More informationUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence
More informationIntro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Why Learn to Program?
Intro to Programming & C++ Unit 1 Sections 1.1-3 and 2.1-10, 2.12-13, 2.15-17 CS 1428 Spring 2018 Ji Seaman 1.1 Why Program? Computer programmabe machine designed to foow instructions Program a set of
More informationLecture outline Graphics and Interaction Scan Converting Polygons and Lines. Inside or outside a polygon? Scan conversion.
Lecture outine 433-324 Graphics and Interaction Scan Converting Poygons and Lines Department of Computer Science and Software Engineering The Introduction Scan conversion Scan-ine agorithm Edge coherence
More informationIntro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Hardware Components Illustrated
Intro to Programming & C++ Unit 1 Sections 1.1-3 and 2.1-10, 2.12-13, 2.15-17 CS 1428 Fa 2017 Ji Seaman 1.1 Why Program? Computer programmabe machine designed to foow instructions Program instructions
More informationIntel Transactional Synchronization Extensions (Intel TSX) Linux update. Andi Kleen Intel OTC. Linux Plumbers Sep 2013
Intel Transactional Synchronization Extensions (Intel TSX) Linux update Andi Kleen Intel OTC Linux Plumbers Sep 2013 Elision Elision : the act or an instance of omitting something : omission On blocking
More informationScalable Locking. Adam Belay
Scalable Locking Adam Belay Problem: Locks can ruin performance 12 finds/sec 9 6 Locking overhead dominates 3 0 0 6 12 18 24 30 36 42 48 Cores Problem: Locks can ruin performance the locks
More informationFunctions. 6.1 Modular Programming. 6.2 Defining and Calling Functions. Gaddis: 6.1-5,7-10,13,15-16 and 7.7
Functions Unit 6 Gaddis: 6.1-5,7-10,13,15-16 and 7.7 CS 1428 Spring 2018 Ji Seaman 6.1 Moduar Programming Moduar programming: breaking a program up into smaer, manageabe components (modues) Function: a
More informationA Petrel Plugin for Surface Modeling
A Petre Pugin for Surface Modeing R. M. Hassanpour, S. H. Derakhshan and C. V. Deutsch Structure and thickness uncertainty are important components of any uncertainty study. The exact ocations of the geoogica
More informationNo connection establishment Do not perform Flow control Error control Retransmission Suitable for small request/response scenario E.g.
UDP & TCP 2018/3/26 UDP Header Characteristics of UDP No connection estabishment Do not perform Fow contro Error contro Retransmission Suitabe for sma request/response scenario E.g., DNS Remote Procedure
More informationCS5460: Operating Systems
CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6) Multiprocessor Memory Models Uniprocessor memory is simple Every load from a location retrieves the last value stored to that
More informationLecture Notes for Chapter 4 Part III. Introduction to Data Mining
Data Mining Cassification: Basic Concepts, Decision Trees, and Mode Evauation Lecture Notes for Chapter 4 Part III Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach,
More informationA METHOD FOR GRIDLESS ROUTING OF PRINTED CIRCUIT BOARDS. A. C. Finch, K. J. Mackenzie, G. J. Balsdon, G. Symonds
A METHOD FOR GRIDLESS ROUTING OF PRINTED CIRCUIT BOARDS A C Finch K J Mackenzie G J Basdon G Symonds Raca-Redac Ltd Newtown Tewkesbury Gos Engand ABSTRACT The introduction of fine-ine technoogies to printed
More informationChapter 5: Transactions in Federated Databases
Federated Databases Chapter 5: in Federated Databases Saes R&D Human Resources Kemens Böhm Distributed Data Management: in Federated Databases 1 Kemens Böhm Distributed Data Management: in Federated Databases
More informationShape Analysis with Structural Invariant Checkers
Shape Anaysis with Structura Invariant Checkers Bor-Yuh Evan Chang Xavier Riva George C. Necua University of Caifornia, Berkeey SAS 2007 Exampe: Typestate with shape anaysis Concrete Exampe Abstraction
More informationAgent architectures. Francesco Amigoni
Francesco Amigoni Designing inteigent agents An agent is defined by its agent function f() that maps a sequence of perceptions to an action p(0), p(1),..., p(t) f() a(t) AGENT perceptions actions ENVIRONMENT
More informationAdvance Operating Systems (CS202) Locks Discussion
Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static
More informationTransactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28
Transactional Memory or How to do multiple things at once Benjamin Engel Transactional Memory 1 / 28 Transactional Memory: Architectural Support for Lock-Free Data Structures M. Herlihy, J. Eliot, and
More informationindex.pdf March 17,
index.pdf March 17, 2013 1 ITI 1121. Introduction to omputing II Marce Turcotte Schoo of Eectrica Engineering and omputer Science Linked List (Part 2) Tai pointer ouby inked ist ummy node Version of March
More informationNearest Neighbor Learning
Nearest Neighbor Learning Cassify based on oca simiarity Ranges from simpe nearest neighbor to case-based and anaogica reasoning Use oca information near the current query instance to decide the cassification
More informationECE 172 Digital Systems. Chapter 5 Uniprocessor Data Cache. Herbert G. Mayer, PSU Status 6/10/2018
ECE 172 Digita Systems Chapter 5 Uniprocessor Data Cache Herbert G. Mayer, PSU Status 6/10/2018 1 Syabus UP Caches Cache Design Parameters Effective Time t eff Cache Performance Parameters Repacement Poicies
More informationSynchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based
More informationSynchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Types of Synchronization
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationRDF Objects 1. Alex Barnell Information Infrastructure Laboratory HP Laboratories Bristol HPL November 27 th, 2002*
RDF Objects 1 Aex Barne Information Infrastructure Laboratory HP Laboratories Bristo HPL-2002-315 November 27 th, 2002* E-mai: Andy_Seaborne@hp.hp.com RDF, semantic web, ontoogy, object-oriented datastructures
More informationA Local Optimal Method on DSA Guiding Template Assignment with Redundant/Dummy Via Insertion
A Loca Optima Method on DSA Guiding Tempate Assignment with Redundant/Dummy Via Insertion Xingquan Li 1, Bei Yu 2, Jiani Chen 1, Wenxing Zhu 1, 24th Asia and South Pacific Design T h e p i c Automation
More informationIntroduction to OpenMP
MPSoC Architectures OpenMP Aberto Bosio, Associate Professor UM Microeectronic Departement bosio@irmm.fr Introduction to OpenMP What is OpenMP? Open specification for Muti-Processing Standard API for defining
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationComputer Networks. College of Computing. Copyleft 2003~2018
Computer Networks Computer Networks Prof. Lin Weiguo Coege of Computing Copyeft 2003~2018 inwei@cuc.edu.cn http://icourse.cuc.edu.cn/computernetworks/ http://tc.cuc.edu.cn Attention The materias beow are
More informationRWLocks in Erlang/OTP
RWLocks in Erlang/OTP Rickard Green rickard@erlang.org What is an RWLock? Read/Write Lock Write locked Exclusive access for one thread Read locked Multiple reader threads No writer threads RWLocks Used
More informationA Concurrent Skip List Implementation with RTM and HLE
A Concurrent Skip List Implementation with RTM and HLE Fan Gao May 14, 2014 1 Background Semester Performed: Spring, 2014 Instructor: Maurice Herlihy The main idea of my project is to implement a skip
More informationLecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 19: Synchronization CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 4 due tonight at 11:59 PM Synchronization primitives (that we have or will
More informationInvyswell: A HyTM for Haswell RTM. Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy
Invyswell: A HyTM for Haswell RTM Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy Multicore Performance Scaling u Problem: Locking u Solution: HTM? u IBM BG/Q, zec12,
More informationMultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction
: A Distributed Shared Memory System Based on Multiple Java Virtual Machines X. Chen and V.H. Allan Computer Science Department, Utah State University 1998 : Introduction Built on concurrency supported
More informationLecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )
Lecture 19: Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 1 Caching Locks Spin lock: to acquire a lock, a process may enter an infinite loop that keeps attempting
More information250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019
250P: Computer Systems Architecture Lecture 14: Synchronization Anton Burtsev March, 2019 Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 2 Constructing Locks Applications
More informationNUMA-aware Reader-Writer Locks. Tom Herold, Marco Lamina NUMA Seminar
04.02.2015 NUMA Seminar Agenda 1. Recap: Locking 2. in NUMA Systems 3. RW 4. Implementations 5. Hands On Why Locking? Parallel tasks access shared resources : Synchronization mechanism in concurrent environments
More informationIntroduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization
Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency
More informationEverything You Always Wanted to Know About Synchronization but Were Afraid to Ask
Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask Tudor David, Rachid Guerraoui and Vasileios Trigonakis Ecole Polytechnique Federale de Lausanne(EPFL) Haksu Lim, Luis,
More informationAs Michi Henning and Steve Vinoski showed 1, calling a remote
Reducing CORBA Ca Latency by Caching and Prefetching Bernd Brügge and Christoph Vismeier Technische Universität München Method ca atency is a major probem in approaches based on object-oriented middeware
More informationLecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown
Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control
More informationOther consistency models
Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationStraight-line code (or IPO: Input-Process-Output) If/else & switch. Relational Expressions. Decisions. Sections 4.1-6, , 4.
If/ese & switch Unit 3 Sections 4.1-6, 4.8-12, 4.14-15 CS 1428 Spring 2018 Ji Seaman Straight-ine code (or IPO: Input-Process-Output) So far a of our programs have foowed this basic format: Input some
More informationSearching, Sorting & Analysis
Searching, Sorting & Anaysis Unit 2 Chapter 8 CS 2308 Fa 2018 Ji Seaman 1 Definitions of Search and Sort Search: find a given item in an array, return the index of the item, or -1 if not found. Sort: rearrange
More informationTaking a Virtual Machine Towards Many-Cores. Rickard Green - Patrik Nyblom -
Taking a Virtual Machine Towards Many-Cores Rickard Green - rickard@erlang.org Patrik Nyblom - pan@erlang.org What we all know by now Number of cores per processor AMD Opteron Intel Xeon Sparc Niagara
More informationLock-Free, Wait-Free and Multi-core Programming. Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap
Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap Lock-Free and Wait-Free Data Structures Overview The Java Maps that use Lock-Free techniques
More informationDistributed Operating Systems
Distributed Operating Systems Synchronization in Parallel Systems Marcus Völp 203 Topics Synchronization Locks Performance Distributed Operating Systems 203 Marcus Völp 2 Overview Introduction Hardware
More informationModule 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.
The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff
More informationSynchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.
Synchronization Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data. Coherency protocols do not: make sure that only one thread accesses shared data
More informationNUMA-Aware Reader-Writer Locks PPoPP 2013
NUMA-Aware Reader-Writer Locks PPoPP 2013 Irina Calciu Brown University Authors Irina Calciu @ Brown University Dave Dice Yossi Lev Victor Luchangco Virendra J. Marathe Nir Shavit @ MIT 2 Cores Chip (node)
More informationPL/SQL, Embedded SQL. Lecture #14 Autumn, Fall, 2001, LRX
PL/SQL, Embedded SQL Lecture #14 Autumn, 2001 Fa, 2001, LRX #14 PL/SQL,Embedded SQL HUST,Wuhan,China 402 PL/SQL Found ony in the Orace SQL processor (sqpus). A compromise between competey procedura programming
More informationffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu
ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { } return ++seqno; // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno()
More informationA Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson
A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson - @mjpt777 If a system does not respond in a timely manner then it is effectively unavailable 1. It s all about the Blocking
More informationCSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Advanced Memory Management
CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Advanced Memory Management Advanced Functionaity Now we re going to ook at some advanced functionaity that the OS can provide appications using
More informationSynchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.
Synchronization sum := thread_create Execution on a sequentially consistent shared-memory machine: Erik Hagersten Uppsala University Sweden while (sum < threshold) sum := sum while + (sum < threshold)
More informationReadme ORACLE HYPERION PROFITABILITY AND COST MANAGEMENT
ORACLE HYPERION PROFITABILITY AND COST MANAGEMENT Reease 11.1.2.4.000 Readme CONTENTS IN BRIEF Purpose... 2 New Features in This Reease... 2 Instaation Information... 2 Supported Patforms... 2 Supported
More informationCSC2/458 Parallel and Distributed Systems Scalable Synchronization
CSC2/458 Parallel and Distributed Systems Scalable Synchronization Sreepathi Pai February 20, 2018 URCS Outline Scalable Locking Barriers Outline Scalable Locking Barriers An alternative lock ticket lock
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationTransactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Transactional Memory Companion slides for The by Maurice Herlihy & Nir Shavit Our Vision for the Future In this course, we covered. Best practices New and clever ideas And common-sense observations. 2
More informationCS4021/4521 INTRODUCTION
CS4021/4521 Advanced Computer Architecture II Prof Jeremy Jones Rm 4.16 top floor South Leinster St (SLS) jones@scss.tcd.ie South Leinster St CS4021/4521 2018 jones@scss.tcd.ie School of Computer Science
More informationSpinlocks. Spinlocks. Message Systems, Inc. April 8, 2011
Spinlocks Samy Al Bahra Devon H. O Dell Message Systems, Inc. April 8, 2011 Introduction Mutexes A mutex is an object which implements acquire and relinquish operations such that the execution following
More informationHTTP Random Access and Live Resources IETF 100
HTTP Random Access and Live Resources IETF 1 Barbara Stark bs7652@att.com Darshak Thakore d.thakore@cabeabs.com Craig Pratt pratt@acm.org / craig@ecaspia.com Use existing bytes Range Unit with very arge
More informationAdaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >
Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization
More informationImplementing Transactional Memory in Kernel space
Implementing Transactional Memory in Kernel space Breno Leitão Linux Technology Center leitao@debian.org leitao@freebsd.org Problem Statement Mutual exclusion concurrency control (Lock) Locks type: Semaphore
More informationLogTM: Log-Based Transactional Memory
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet
More informationA Memory Grouping Method for Sharing Memory BIST Logic
A Memory Grouping Method for Sharing Memory BIST Logic Masahide Miyazai, Tomoazu Yoneda, and Hideo Fuiwara Graduate Schoo of Information Science, Nara Institute of Science and Technoogy (NAIST), 8916-5
More informationUnit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth
Unit 12: Memory Consistency Models Includes slides originally developed by Prof. Amir Roth 1 Example #1 int x = 0;! int y = 0;! thread 1 y = 1;! thread 2 int t1 = x;! x = 1;! int t2 = y;! print(t1,t2)!
More informationA Comparison of a Second-Order versus a Fourth- Order Laplacian Operator in the Multigrid Algorithm
A Comparison of a Second-Order versus a Fourth- Order Lapacian Operator in the Mutigrid Agorithm Kaushik Datta (kdatta@cs.berkeey.edu Math Project May 9, 003 Abstract In this paper, the mutigrid agorithm
More informationDesign of IP Networks with End-to. to- End Performance Guarantees
Design of IP Networks with End-to to- End Performance Guarantees Irena Atov and Richard J. Harris* ( Swinburne University of Technoogy & *Massey University) Presentation Outine Introduction Mutiservice
More informationFast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors
Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Deli Zhang, Brendan Lynch, and Damian Dechev University of Central Florida, Orlando, USA April 27, 2016 Mutual Exclusion
More informationModels of concurrency & synchronization algorithms
Models of concurrency & synchronization algorithms Lecture 3 of TDA383/DIT390 (Concurrent Programming) Carlo A. Furia Chalmers University of Technology University of Gothenburg SP3 2016/2017 Today s menu
More informationFast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors
Background Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Deli Zhang, Brendan Lynch, and Damian Dechev University of Central Florida, Orlando, USA December 18,
More informationSearching & Sorting. Definitions of Search and Sort. Other forms of Linear Search. Linear Search. Week 11. Gaddis: 8, 19.6,19.8.
Searching & Sorting Week 11 Gaddis: 8, 19.6,19.8 CS 5301 Fa 2017 Ji Seaman 1 Definitions of Search and Sort Search: find a given item in a ist, return the position of the item, or -1 if not found. Sort:
More informationPerformance Improvement via Always-Abort HTM
1 Performance Improvement via Always-Abort HTM Joseph Izraelevitz* Lingxiang Xiang Michael L. Scott* *Department of Computer Science University of Rochester {jhi1,scott}@cs.rochester.edu Parallel Computing
More informationApplication of Intelligence Based Genetic Algorithm for Job Sequencing Problem on Parallel Mixed-Model Assembly Line
American J. of Engineering and Appied Sciences 3 (): 5-24, 200 ISSN 94-7020 200 Science Pubications Appication of Inteigence Based Genetic Agorithm for Job Sequencing Probem on Parae Mixed-Mode Assemby
More informationVMM Emulation of Intel Hardware Transactional Memory
VMM Emulation of Intel Hardware Transactional Memory Maciej Swiech, Kyle Hale, Peter Dinda Northwestern University V3VEE Project www.v3vee.org Hobbes Project 1 What will we talk about? We added the capability
More informationAuthorization of a QoS Path based on Generic AAA. Leon Gommans, Cees de Laat, Bas van Oudenaarde, Arie Taal
Abstract Authorization of a QoS Path based on Generic Leon Gommans, Cees de Laat, Bas van Oudenaarde, Arie Taa Advanced Internet Research Group, Department of Computer Science, University of Amsterdam.
More informationJava Semaphore (Part 1)
Java Semaphore (Part 1) Douglas C. Schmidt d.schmidt@vanderbilt.edu www.dre.vanderbilt.edu/~schmidt Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee, USA Learning Objectives
More informationSynchronization COMPSCI 386
Synchronization COMPSCI 386 Obvious? // push an item onto the stack while (top == SIZE) ; stack[top++] = item; // pop an item off the stack while (top == 0) ; item = stack[top--]; PRODUCER CONSUMER Suppose
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationAlgorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott
Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott Presentation by Joe Izraelevitz Tim Kopp Synchronization Primitives Spin Locks Used for
More informationCPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.
CPSC/ECE 3220 Fall 2017 Exam 1 Name: 1. Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.) Referee / Illusionist / Glue. Circle only one of R, I, or G.
More informationArrays. Array Data Type. Array - Memory Layout. Array Terminology. Gaddis: 7.1-3,5
Arrays Unit 5 Gaddis: 7.1-3,5 CS 1428 Spring 2018 Ji Seaman Array Data Type Array: a variabe that contains mutipe vaues of the same type. Vaues are stored consecutivey in memory. An array variabe decaration
More informationBridge Talk Release Notes for Meeting Exchange 5.0
Bridge Tak Reease Notes for Meeting Exchange 5.0 This document ists new product features, issues resoved since the previous reease, and current operationa issues. New Features This section provides a brief
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationSpace-Time Trade-offs.
Space-Time Trade-offs. Chethan Kamath 03.07.2017 1 Motivation An important question in the study of computation is how to best use the registers in a CPU. In most cases, the amount of registers avaiabe
More informationRole of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout
CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of
More informationWelcome - CSC 301. CSC 301- Foundations of Programming Languages
Wecome - CSC 301 CSC 301- Foundations of Programming Languages Instructor: Dr. Lutz Hame Emai: hame@cs.uri.edu Office: Tyer, Rm 251 Office Hours: TBA TA: TBA Assignments Assignment #0: Downoad & Read Syabus
More informationCSE506: Operating Systems CSE 506: Operating Systems
CSE 506: Operating Systems Disk Scheduling Key to Disk Performance Don t access the disk Whenever possible Cache contents in memory Most accesses hit in the block cache Prefetch blocks into block cache
More information