Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA
|
|
- Drusilla Shepherd
- 6 years ago
- Views:
Transcription
1 Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA
2 Agenda Background Motivation Remote Memory Request Shared Address Synchronization Remote Core Locking Combiner Approach Wait Free Locks Limitations of SAS Performance tied to Cache Coherency model Hybrid Solution with Message Passing Server Approach MP-SERVER Combiner Approach HYBCOMB Performance Evaluation Conclusion
3 Motivation Application acceleration through many-core processors Obstacles of Efficient Parallelization Artificial Communication Critical Sections & Synchronization Effectively linearize parallel processing Commonly Shared Objects Involving Synchronization Queues Stacks
4 Remote Memory Request (RMRs) An access to data shared between all processes that requires communication to shared memory Data that is not readily available in the local process cache Generated for any updates or accesses to a shared variable not present in the core
5 Traditional SAS Synchronization Techniques Remote Core Locking Dedicated process (server) for executing critical sections Combiner Approach Every process will adopt critical section execution duty when needed Wait Free Objects Locally modify shared object and only commit if no other thread has modified it
6 Remote Core Locking Transfer execution of a critical section (CS) to the server Server is the only entity that can modify variables in the critical section Improve data locality Reduce Cache Misses Reduce Hot Spots
7 RCL Request List [2] Data to Server Address of the lock Variables (encapsulated in a structure) Address of the CS function Stored in a cache line for SAS
8 Remote Core Locking
9 Combiner Approach No dedicated server to handle CS requests A Head Node or combiner services requests When a CS is reached the task is submitted to the combiner If there is no combiner, the process submitting the task is now the combiner Combiner runs until the request queue is empty or h requests have been serviced SAS Implementations CC-Synch H-Synch Message Passing Implementations HybComb
10 CC-Synch Coherent-Cache Synch List of CS requests maintained in shared memory Atomic operations are used to access the CS request list When a process reaches a critical section Adds the request to the list Tries to acquire a global lock If Successful Thread is now the head node or combiner Services up to h requests Then returns If not successful Spin until it becomes the head node or the CS is completed
11 H-Synch Hierarchal-Synch Implementation for clusters m processors, c clusters m/c processors per cluster Communication slower across clusters CC-Synch locally on each cluster CS Synchronization across clusters: Before the local combiner executes requests Acquire lock L to ensure other clusters do not access the CS Then release L after servicing the request
12 Wait Free Construction Goal: Eliminate serialization delay in shared objects Use atomic primitives on shared objects that execute in a fixed number of cycles CAS LL/SC Fetch&Add Combination of the three primitives are often used for synchronization Use the wait-free primitives for the combiner identity
13 Wait Free Construction CAS Compare and Swap Supports two operations 1. read(o) return the current value of O 2. CAS(O, u, v) Compare O to u If equal, set O to v and return true Not equal, return false LL/SC - Locked Load Store Conditional Supports two operations 1. LL(O)return current value of O 2. SC(O, v) by process p O takes the value of v if and only if no other process has performed SC on O Return true if successful, otherwise false
14 Wait Free Construction Fetch & Add Supports two operations 1. read(o) 2. Fetch&Add(O, x) O+=x Return previous O One memory access Minimize serialization delays
15 Shortcomings of Traditional SAS CC Difficult to scale Performance is tied to cache coherency model RCL or Combiner on SAS still have to synchronize request queue in cache Short CSes dominated by cache-coherency stalls Optimization requires in-depth understanding of hardware Most programmers do not understand the underlying hardware architecture HW Venders hide hardware information
16 Thread Synchronization Using Message Passing Message passing allows explicit control over communication Message Passing Only MP-Server Hybrid SAS & Message Passing HybComb Message Passing better suited for sending tasks to the combiner SAS better suited for maintaining the identity of the combiner Meant for a platform with HW for SAS & Message Passing
17 MP-Server Essentially RCL using Message Passing Server reads requests from local message queue No Remote Memory Requests (RMRs) involved When sending response for a finished CS request Server does not wait for transmission to complete
18 HybComb HW Message Passing synchronization between the combiner and other processes SAS and Cache Stores identity of the combiner When a CS is reached by a process 1. Check the shared memory for the identity of the combiner 2. Send the task to the combiner If no combiner is present, become the combiner Write process identity to the shared memory Using CAS(O,u,v) 3. Wait for completion
19 Performance Concurrent Counter Message Passing implementations perform better MP-Server has the best throughput & latency Little difference between HybComb and SAS RCL & CC-Synch until 15+ application threads [1]
20 Performance [1] Much lower percentage of stalls with Message Passing Shorter Critical Section Length
21 Performance Stacks & Queues Message Passing still offers a performance advantage with queues and stacks Best Performance is achieved by limiting the number of locks in the queue [1]
22 Conclusion Message Passing is a better alternative for thread synchronization MP-Server is the best candidate Out performs HybComb in every characteristic Much lower algorithmic complexity Only Message Passing HW needed
23 References [1]D. Petrović, T. Ropars and A. Schiper, "Leveraging hardware message passing for efficient thread synchronization", ACM SIGPLAN Notices, vol. 49, no. 8, pp , [2]J. Lozi, F. David, G. Thomas, J. Lawal and G. Muller, "Remote Core Locking: MIgrating Critical-Section Execution to Improve the Performance of Multithreaded Applications", Proceedings of the 2012 USENIX Annual Technical Conference, [3]P. Fatourou and N. Kallimanis, "Revisiting the combining synchronization technique", ACM SIGPLAN Notices, vol. 47, no. 8, p. 257, [4]P. Fatourou and N. Kallimanis, "Highly-Efficient Wait-Free Synchronization", Theory of Computing Systems, vol. 55, no. 3, pp , 2013.
Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems
Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University
More informationLock Oscillation: Boosting the Performance of Concurrent Data Structures
Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou FORTH ICS & University of Crete Nikolaos D. Kallimanis FORTH ICS The Multicore Era The dominance of Multicore
More informationRemote Core Locking. Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. to appear at USENIX ATC 12
Remote Core Locking Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications to appear at USENIX ATC 12 Jean-Pierre Lozi LIP6/INRIA Florian David LIP6/INRIA Gaël Thomas
More informationLock Oscillation: Boosting the Performance of Concurrent Data Structures
Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou 1 and Nikolaos D. Kallimanis 2 1 Institute of Computer Science (ICS), Foundation of Research and Technology-Hellas
More informationRevisiting the Combining Synchronization Technique
Revisiting the Combining Synchronization Technique Panagiota Fatourou Department of Computer Science University of Crete & FORTH ICS faturu@csd.uoc.gr Nikolaos D. Kallimanis Department of Computer Science
More informationFast and Portable Locking for Multicore Architectures
Fast and Portable Locking for Multicore Architectures Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, Gilles Muller To cite this version: Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia
More informationMultiprocessor Synchronization
Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing
More informationDistributed Computing
HELLENIC REPUBLIC UNIVERSITY OF CRETE Distributed Computing Graduate Course Section 3: Spin Locks and Contention Panagiota Fatourou Department of Computer Science Spin Locks and Contention In contrast
More informationSynchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.
Synchronization Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data. Coherency protocols do not: make sure that only one thread accesses shared data
More informationCS5460: Operating Systems
CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6) Multiprocessor Memory Models Uniprocessor memory is simple Every load from a location retrieves the last value stored to that
More informationLeveraging Hardware Message Passing for Efficient Thread Synchronization
Leveraging Hardware Message Passing for Efficient Thread Synchronization Darko Petrović Thomas Ropars André Schiper Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland firstname.lastname@epfl.ch
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationA Practical Scalable Distributed B-Tree
A Practical Scalable Distributed B-Tree CS 848 Paper Presentation Marcos K. Aguilera, Wojciech Golab, Mehul A. Shah PVLDB 08 March 8, 2010 Presenter: Evguenia (Elmi) Eflov Presentation Outline 1 Background
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections
Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationConcurrent Counting using Combining Tree
Final Project Report by Shang Wang, Taolun Chai and Xiaoming Jia Concurrent Counting using Combining Tree 1. Introduction Counting is one of the very basic and natural activities that computers do. However,
More informationPerformance and Optimization Issues in Multicore Computing
Performance and Optimization Issues in Multicore Computing Minsoo Ryu Department of Computer Science and Engineering 2 Multicore Computing Challenges It is not easy to develop an efficient multicore program
More informationLecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )
Lecture: Coherence and Synchronization Topics: synchronization primitives, consistency models intro (Sections 5.4-5.5) 1 Performance Improvements What determines performance on a multiprocessor: What fraction
More informationLecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )
Lecture 19: Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 1 Caching Locks Spin lock: to acquire a lock, a process may enter an infinite loop that keeps attempting
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu
ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { } return ++seqno; // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno()
More informationAn Overview of MIPS Multi-Threading. White Paper
Public Imagination Technologies An Overview of MIPS Multi-Threading White Paper Copyright Imagination Technologies Limited. All Rights Reserved. This document is Public. This publication contains proprietary
More informationChapter 5 Thread-Level Parallelism. Abdullah Muzahid
Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex
More informationNUMA-aware Reader-Writer Locks. Tom Herold, Marco Lamina NUMA Seminar
04.02.2015 NUMA Seminar Agenda 1. Recap: Locking 2. in NUMA Systems 3. RW 4. Implementations 5. Hands On Why Locking? Parallel tasks access shared resources : Synchronization mechanism in concurrent environments
More informationAdvance Operating Systems (CS202) Locks Discussion
Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationParallel Computer Architecture Spring Memory Consistency. Nikos Bellas
Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency
More informationPart 1: Concepts and Hardware- Based Approaches
Part 1: Concepts and Hardware- Based Approaches CS5204-Operating Systems Introduction Provide support for concurrent activity using transactionstyle semantics without explicit locking Avoids problems with
More informationPer-Thread Batch Queues For Multithreaded Programs
Per-Thread Batch Queues For Multithreaded Programs Tri Nguyen, M.S. Robert Chun, Ph.D. Computer Science Department San Jose State University San Jose, California 95192 Abstract Sharing resources leads
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More informationSpeculative Locks. Dept. of Computer Science
Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency
More informationLecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown
Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control
More informationLecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory
Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory 1 Capacity Limitations P P P P B1 C C B1 C C Mem Coherence Monitor Mem Coherence Monitor B2 In a Sequent NUMA-Q design above,
More informationGLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs
GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs Authors: Jos e L. Abell an, Juan Fern andez and Manuel E. Acacio Presenter: Guoliang Liu Outline Introduction Motivation Background
More informationScalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems
Cray XMT Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems Next Generation Cray XMT Goals Memory System Improvements
More informationSynchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.
Synchronization sum := thread_create Execution on a sequentially consistent shared-memory machine: Erik Hagersten Uppsala University Sweden while (sum < threshold) sum := sum while + (sum < threshold)
More informationSynchronization COMPSCI 386
Synchronization COMPSCI 386 Obvious? // push an item onto the stack while (top == SIZE) ; stack[top++] = item; // pop an item off the stack while (top == 0) ; item = stack[top--]; PRODUCER CONSUMER Suppose
More informationComputer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>
Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: This is a closed book, closed notes exam. 80 Minutes 19 pages Notes: Not all questions
More informationLecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )
Lecture: Coherence, Synchronization Topics: directory-based coherence, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory) keeps track
More informationIntroduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization
Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationReducing contention in STM
Reducing contention in STM Panagiota Fatourou Department of Computer Science University of Crete & FORTH ICS faturu@csd.uoc.gr Mykhailo Iaremko 1 Institute of Computer Science (ICS) Foundation for Research
More informationLogTM: Log-Based Transactional Memory
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet
More informationConcurrent Preliminaries
Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures
More informationMotivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency
Shared Memory Consistency Models Authors : Sarita.V.Adve and Kourosh Gharachorloo Presented by Arrvindh Shriraman Motivations Programmer is required to reason about consistency to ensure data race conditions
More informationCS3350B Computer Architecture
CS3350B Computer Architecture Winter 2015 Lecture 7.2: Multicore TLP (1) Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design, Patterson & Hennessy,
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationMulti-threaded programming in Java
Multi-threaded programming in Java Java allows program to specify multiple threads of execution Provides instructions to ensure mutual exclusion, and selective blocking/unblocking of threads What is a
More informationRole of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout
CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of
More information1) If a location is initialized to 0, what will the first invocation of TestAndSet on that location return?
Synchronization Part 1: Synchronization - Locks Dekker s Algorithm and the Bakery Algorithm provide software-only synchronization. Thanks to advancements in hardware, synchronization approaches have been
More informationCS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism
CS 61C: Great Ideas in Computer Architecture Amdahl s Law, Thread Level Parallelism Instructor: Alan Christopher 07/17/2014 Summer 2014 -- Lecture #15 1 Review of Last Lecture Flynn Taxonomy of Parallel
More informationReal-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo
Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract
More informationNon-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei
Non-blocking Array-based Algorithms for Stacks and Queues Niloufar Shafiei Outline Introduction Concurrent stacks and queues Contributions New algorithms New algorithms using bounded counter values Correctness
More informationCS510 Advanced Topics in Concurrency. Jonathan Walpole
CS510 Advanced Topics in Concurrency Jonathan Walpole Threads Cannot Be Implemented as a Library Reasoning About Programs What are the valid outcomes for this program? Is it valid for both r1 and r2 to
More informationTowards scalable RDMA locking on a NIC
TORSTEN HOEFLER spcl.inf.ethz.ch Towards scalable RDMA locking on a NIC with support of Patrick Schmid, Maciej Besta, Salvatore di Girolamo @ SPCL presented at HP Labs, Palo Alto, CA, USA NEED FOR EFFICIENT
More informationEE382 Processor Design. Processor Issues for MP
EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency
More informationTransactional Memory
Transactional Memory Michał Kapałka EPFL, LPD STiDC 08, 1.XII 2008 Michał Kapałka (EPFL, LPD) Transactional Memory STiDC 08, 1.XII 2008 1 / 25 Introduction How to Deal with Multi-Threading? Locks? Wait-free
More informationNOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.
Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which
More informationOrder Is A Lie. Are you sure you know how your code runs?
Order Is A Lie Are you sure you know how your code runs? Order in code is not respected by Compilers Processors (out-of-order execution) SMP Cache Management Understanding execution order in a multithreaded
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationEEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)
EEC 581 Computer rchitecture Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6) Chansu Yu Electrical and Computer Engineering Cleveland State University cknowledgement Part of class notes
More informationSMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT.
SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT. SMT performance evaluation vs. Fine-grain multithreading Superscalar, Chip Multiprocessors.
More information740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1
CS252 Spring 2017 Graduate Computer Architecture Lecture 14: Multithreading Part 2 Synchronization 1 Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture
More informationwait with priority An enhanced version of the wait operation accepts an optional priority argument:
wait with priority An enhanced version of the wait operation accepts an optional priority argument: syntax: .wait the smaller the value of the parameter, the highest the priority
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationSummary: Open Questions:
Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit
More informationl12 handout.txt l12 handout.txt Printed by Michael Walfish Feb 24, 11 12:57 Page 1/5 Feb 24, 11 12:57 Page 2/5
Feb 24, 11 12:57 Page 1/5 1 Handout for CS 372H 2 Class 12 3 24 February 2011 4 5 1. CAS / CMPXCHG 6 7 Useful operation: compare and swap, known as CAS. Says: "atomically 8 check whether a given memory
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More informationTransactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93
Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 What are lock-free data structures A shared data structure is lock-free if its operations
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More information250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019
250P: Computer Systems Architecture Lecture 14: Synchronization Anton Burtsev March, 2019 Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 2 Constructing Locks Applications
More informationQueue Delegation Locking
Queue Delegation Locking David Klaftenegger Konstantinos Sagonas Kjell Winblad Department of Information Technology, Uppsala University, Sweden Abstract The scalability of parallel programs is often bounded
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationCSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs
CSE 451: Operating Systems Winter 2005 Lecture 7 Synchronization Steve Gribble Synchronization Threads cooperate in multithreaded programs to share resources, access shared data structures e.g., threads
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationThe complete license text can be found at
SMP & Locking These slides are made distributed under the Creative Commons Attribution 3.0 License, unless otherwise noted on individual slides. You are free: to Share to copy, distribute and transmit
More informationBrushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool
Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell School of Business and Informatics University of Borås, 50 90 Borås E-mail: Hakan.Sundell@hb.se Philippas
More informationLock cohorting: A general technique for designing NUMA locks
Lock cohorting: A general technique for designing NUMA locks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published
More informationDESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors
DESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors DIMACS March 15 th, 2011 Philippas Tsigas Data Structures In Manycore Sys. Decomposition Synchronization Load Balancing
More informationDesign of MPI Passive Target Synchronization for a Non-Cache- Coherent Many-Core Processor
Design of MPI Passive Target Synchronization for a Non-Cache- Coherent Many-Core Processor 27th PARS Workshop, Hagen, Germany, May 5 2017 Steffen Christgau, Bettina Schnor Operating Systems and Distributed
More informationLecture 10: Avoiding Locks
Lecture 10: Avoiding Locks CSC 469H1F Fall 2006 Angela Demke Brown (with thanks to Paul McKenney) Locking: A necessary evil? Locks are an easy to understand solution to critical section problem Protect
More informationMultiprocessor Systems. Chapter 8, 8.1
Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor
More informationSoftware transactional memory
Transactional locking II (Dice et. al, DISC'06) Time-based STM (Felber et. al, TPDS'08) Mentor: Johannes Schneider March 16 th, 2011 Motivation Multiprocessor systems Speed up time-sharing applications
More informationHigh-Performance Distributed RMA Locks
High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER Presented at ARM Research, Austin, Texas, USA, Feb. 2017 NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch
More information