ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu
|
|
- Grant McDowell
- 5 years ago
- Views:
Transcription
1 ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu
2 int get_seqno() { } return ++seqno; // ~1 Billion ops/s // single-threaded
3 int threadsafe_get_seqno() { acquire(lock); int ret=++seqno; release(lock); return ret; } // < 10 Million ops/s
4 10 MCS MUTEX TTAS CLH TAS 8 Throughput (Mops) Hardware threads
5 why so slow?
6
7
8 ~70 ns intra-socket latency ~14 Mops
9 QPI QPI Quad-Socket Intel System QPI M cachelines/s gigabyte/s QPI ~200ns O/W latency = 5 million ops per second
10 THREAD 1 THREAD 2 critical section interconnect latency wait for lock 400x wait for lock interconnect latency critical section wait for lock critical section
11 THREAD 1 THREAD 2 THREAD3 critical section wait for lock critical section wait for lock 600x wait for lock wait for lock critical section wait for lock critical section
12 client dedicated server thread client client client
13 CLIENT1 DEDICATED SERVER THREAD request wait for request 400x wait for critical section wait for request wait for critical section
14 CLIENT1 DEDICATED SERVER THREAD CLIENTn request wait for request request 400x still! wait for critical section critical section wait for wait for request wait for critical section wait for critical section
15 CLIENT1 DEDICATED SERVER THREAD CLIENTn wait for request 400x still! wait for critical section critical section critical section critical section critical section critical section wait for critical request section wait for wait for wait for wait wait for for wait for wait for wait for critical section critical section critical section critical section wait for wait for wait for wait for wait for
16 (fast, fly-weight delegation) ffwd design write all s, thread group N SERVER read & act on all thread group 0 requests READ client request one line per thread CLIENTS WRITE write request to server client request for each of N thread groups read & act on all thread group 1 requests write all s, thread group 0 WRITE shared server one line per group of 15 threads READ spin on server shared server (128 bytes) client request (64 bytes) return values[15] toggle bits toggle bit function pointer arg count argv[6] Server acts upon pending requests in batches 15 clients. Each group of 15 clients shares one 128-byte line pair. One dedicated 64-byte request line, per client-server pair Requests are sent synchronously
17 A request in more detail client request (64 bytes) toggle bit function pointer arg count argv[6] shared server (128 bytes) return values[15] toggle bits request is new if: request toggle bit!= toggle bit server calls function with (64-bit) arguments provided client polls line until toggle bit == bit
18 delegation server thread group 0 requests local buffer..111
19 delegation server thread group 0 requests local buffer..110
20 delegation server thread group 0 requests local buffer..100
21 delegation server thread group 0 requests local buffer..100
22 delegation server thread group 0 requests local buffer..100 global buffer modified < cache lines > shared
23 group 0 requests local buffer..100 global buffer..100 modified < cache lines > shared
24 group shared 0 requests < -requests- > modified local buffer..100 global buffer..100 modified < cache lines > shared
25 shared < -requests- > modified local buffer..100 global buffer..100 modified < cache lines > shared
26 shared < -requests- > modified local buffer..100 global buffer..100 modified < cache lines > shared
27 shared < -requests- > modified local buffer..100 global buffer..100 modified < cache lines > shared
28 performance evaluation
29 evaluation systems 4 16-core Xeon E5-4660, Broadwell, 2.2 GHz 4 8-core Xeon E5-4620, Sandy Bridge-EP, 2.2 GHz 4 8-core Xeon E7-4820, Westmere-EX, 2.0 GHz 4 8-core AMD Opteron 6378, Abu Dhabi, 2.4 GHz
30 application benchmarks Same benchmarks as in Lozi et al. (RCL) [USENIX ATC 12] programs that spend large % of time in critical sections Except BerkeleyDB - ran out of time
31 raytrace-car (SPLASH-2) FFWD MCS MUTEX TAS FC RCL Duration (ms) mutex ffwd 0 MCS RCL # of threads RCL experienced correctness issues above 82 threads.
32 application benchmarks comparing best performance (any thread count) for all methods up to 2.5x improvement over pthreads, any thread count 10+ times speedup at max thread count
33 memcached-set 300 FFWD MCS MUTEX TAS RCL Duration (sec) pthread mutex # of threads RCL experienced correctness issues above 24 threads. We did not get Flat Combining to work.
34 microbenchmarks
35 ffwd is much faster on largely sequential data structures linked list (coarse locking), stack, queue fetch and add, for few shared variables for highly concurrent data structures, ffwd falls behind when the lock contention is low fetch and add, with many shared variables hashtable for concurrent data structures with long query times, ffwd keeps up, but is not a clear leader lazy linked list binary search tree
36 naïve 1024-node linked-list, coarse-grained locking FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET STM Throughput (Mops) Hardware threads
37 two-lock queue FFWD MCS MUTEX TTAS TICKET CLH HTICKET FC RCL CC DSM H MS SIM BLF Throughput (Mops) Hardware threads
38 stack FFWD MCS MUTEX TTAS TICKET CLH HTICKET FC RCL CC DSM H LF SIM BLF Throughput (Mops) Hardware threads
39 fetch-and-add, 1 variable FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL ATOMIC Throughput (Mops) hardware-provided atomic increment instruction! Hardware threads
40 ffwd is much faster on largely sequential data structures naïve linked list, stack, queue fetch and add, for few shared variables for highly concurrent data structures, fwd falls behind when there are many locks fetch and add, with many shared variables hashtable for concurrent data structures with long query times, ffwd keeps up, but is not a clear leader lazy linked list binary search tree
41 128-thread hash table FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET 175 Throughput (Mops) #buckets locking takes the lead when #locks ~ #threads
42 fetch-and-add, 128 threads FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL 300 Throughput (Mops) # of shared variables (locks)
43 fetch-and-add, 128 threads FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL ATOMIC 300 Throughput (Mops) atomic increment > # of shared variables (locks)
44 ffwd is much faster on largely sequential data structures naïve linked list, stack, queue fetch and add, for few shared variables for highly concurrent data structures, once lock# is similar to thread#, fwd falls behind fetch and add, with many shared variables hashtable for concurrent data structures with long query times, ffwd keeps up, but is not a clear leader lazy linked list binary search tree
45 128-thread lazy concurrent lists FFWD-LZ MCS-LZ MUTEX-LZ TTAS-LZ TICKET-LZ CLH-LZ TAS-LZ HTICKET-LZ HARRIS FC-LZ RCL-LZ Throughput (Mops) #elements
46 128-thread lazy (LZ) + skip (SK) lists FFWD-LZ FFWD-SK MCS-LZ MCS-SK MUTEX-LZ TTAS-LZ TICKET-LZ CLH-LZ TAS-LZ HTICKET-LZ HARRIS FC-LZ RCL-LZ Throughput (Mops) high concurrency, not very long list lower skip list complexity saves the day #elements single MCS lock skip list
47 binary search tree simple, unbalanced tree 50% queries, 50% updates all tree operations delegated for ffwd/rcl
48 128-thread binary search tree 50 FFWD RCL RCU RLU SWISSTM VRBTREE VRTREE Single threaded Throughput (Mops/s) ffwd is bounded by single-threaded performance K 8K 32K 128K Initial Size
49 128-thread tree + 4-way sharding ffwd 50 FFWD FFWD-S4 RCL RCU RLU SWISSTM VRBTREE VRTREE Single threaded Throughput (Mops) K 8K 32K 128K Tree size
50 what makes ffwd so fast? write all s, thread group N SERVER read & act on all thread group 0 requests READ client request one line per thread CLIENTS WRITE write request to server client request for each of N thread groups read & act on all thread group 1 requests write all s, thread group 0 WRITE shared server one line per group of 15 threads READ spin on server shared server (128 bytes) requests are virtually un-contended, toggle return values[15] bits contiguous in memory = happy server hardware pre-fetcher buffered s on server 15 s in one contiguous copy client request (64 bytes) 2 modified cache-lines instead of 15 toggle bit function pointer arg count argv[6] s are read-only on the client line never leaves the server L1 very light-weight processing on the server plenty of hand-tuning
51 Why isn t it even faster? Link bandwidth is 300 cache lines per link > 300+ Mops Latency suggests 2.5 Mops/client. 120 clients > 300 Mops Why are we only seeing 55 Mops? Processing limit? 55 Mops = 40 cycles per operation Insufficient concurrency: round-trip bandwidth-delay product is 120 cache lines server store / load buffers, reorder window size?
52 using ffwd Free C library available now (Rust is on the way) Some current limitations: delegated functions cannot, in turn, delegate functions delegated functions typically should not block (nor acquire locks) up to 6, 64-bit parameters currently assume one client per hardware thread
53 related work Remote Core Locking [Lozi, USENIX ATC 12] Barrelfish - delegation-based OS [Baumann, SOSP 09] Flat Combining [Hendler, SPAA 10] Log-based node replication [Calciu, ASPLOS 17]
54 in conclusion delegation is (much) faster than you thought it is easy to use, and has many attractive applications similar results on Intel Broadwell Intel Sandy Bridge Intel Westmere-EX, and AMD Abu Dhabi
55 questions? UIC has many open CS faculty positions this year, all areas libffwd, extended paper and more
56 comparing parallelism locking delegation critical section #locks #servers communication latency #locks #clients
Remote Core Locking. Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. to appear at USENIX ATC 12
Remote Core Locking Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications to appear at USENIX ATC 12 Jean-Pierre Lozi LIP6/INRIA Florian David LIP6/INRIA Gaël Thomas
More informationOptimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems
Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University
More informationSynchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.
Synchronization sum := thread_create Execution on a sequentially consistent shared-memory machine: Erik Hagersten Uppsala University Sweden while (sum < threshold) sum := sum while + (sum < threshold)
More informationMessage Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA
Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA Agenda Background Motivation Remote Memory Request Shared Address Synchronization Remote
More informationAdvance Operating Systems (CS202) Locks Discussion
Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static
More informationEverything You Always Wanted to Know About Synchronization but Were Afraid to Ask
Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask Tudor David, Rachid Guerraoui and Vasileios Trigonakis Ecole Polytechnique Federale de Lausanne(EPFL) Haksu Lim, Luis,
More informationSynchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011
Synchronization CS61, Lecture 18 Prof. Stephen Chong November 3, 2011 Announcements Assignment 5 Tell us your group by Sunday Nov 6 Due Thursday Nov 17 Talks of interest in next two days Towards Predictable,
More informationPerformance Improvement via Always-Abort HTM
1 Performance Improvement via Always-Abort HTM Joseph Izraelevitz* Lingxiang Xiang Michael L. Scott* *Department of Computer Science University of Rochester {jhi1,scott}@cs.rochester.edu Parallel Computing
More informationMemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing
MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU), Dave Andersen (CMU), Michael Kaminsky (Intel Labs) NSDI 2013 http://www.pdl.cmu.edu/ 1 Goal: Improve Memcached 1. Reduce space overhead
More informationNON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017
NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 3 Nov 2017 Lecture 1/3 Introduction Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More information1754 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 6, JUNE Scalable Adaptive NUMA-Aware Lock
1754 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 6, JUNE 2017 Scalable Adaptive NUMA-Aware Lock Mingzhe Zhang, Haibo Chen, Senior Member, IEEE, Luwei Cheng, Francis C. M. Lau, Senior
More informationMultiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.
Multiprocessor OS COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2 2 Key design challenges: Correctness of (shared) data structures Scalability Scalability of Multiprocessor OS
More informationCSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs
CSE 451: Operating Systems Winter 2005 Lecture 7 Synchronization Steve Gribble Synchronization Threads cooperate in multithreaded programs to share resources, access shared data structures e.g., threads
More informationCS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University
CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 The Process Concept 2 The Process Concept Process a program in execution
More informationHigh-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han
High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han Seoul National University, Korea Dongduk Women s University, Korea Contents Motivation and Background
More informationUtilizing the IOMMU scalably
Utilizing the IOMMU scalably Omer Peleg, Adam Morrison, Benjamin Serebrin, and Dan Tsafrir USENIX ATC 15 2017711456 Shin Seok Ha 1 Introduction What is an IOMMU? Provides the translation between IO addresses
More informationMULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS
Overview COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS Multiprocessor OS Scalability Multiprocessor Hardware Contemporary systems Experimental and Future systems OS design for
More informationAdvanced Multiprocessor Programming Project Topics and Requirements
Advanced Multiprocessor Programming Project Topics and Requirements Jesper Larsson Trä TU Wien May 5th, 2017 J. L. Trä AMP SS17, Projects 1 / 21 Projects Goal: Get practical, own experience with concurrent
More informationHigh-Performance Key-Value Store on OpenSHMEM
High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background
More informationConcurrent Data Structures Concurrent Algorithms 2016
Concurrent Data Structures Concurrent Algorithms 2016 Tudor David (based on slides by Vasileios Trigonakis) Tudor David 11.2016 1 Data Structures (DSs) Constructs for efficiently storing and retrieving
More informationFlat Parallelization. V. Aksenov, ITMO University P. Kuznetsov, ParisTech. July 4, / 53
Flat Parallelization V. Aksenov, ITMO University P. Kuznetsov, ParisTech July 4, 2017 1 / 53 Outline Flat-combining PRAM and Flat parallelization PRAM binary heap with Flat parallelization ExtractMin Insert
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationFast and Portable Locking for Multicore Architectures
Fast and Portable Locking for Multicore Architectures Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, Gilles Muller To cite this version: Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia
More informationCSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization Hank Levy Levy@cs.washington.edu 412 Sieg Hall Synchronization Threads cooperate in multithreaded programs to share resources, access shared
More informationLock Oscillation: Boosting the Performance of Concurrent Data Structures
Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou FORTH ICS & University of Crete Nikolaos D. Kallimanis FORTH ICS The Multicore Era The dominance of Multicore
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence
More informationSemaphore. Originally called P() and V() wait (S) { while S <= 0 ; // no-op S--; } signal (S) { S++; }
Semaphore Semaphore S integer variable Two standard operations modify S: wait() and signal() Originally called P() and V() Can only be accessed via two indivisible (atomic) operations wait (S) { while
More informationCS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University
CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 Process creation in UNIX All processes have a unique process id getpid(),
More informationLecture 10: Multi-Object Synchronization
CS 422/522 Design & Implementation of Operating Systems Lecture 10: Multi-Object Synchronization Zhong Shao Dept. of Computer Science Yale University Acknowledgement: some slides are taken from previous
More informationData Processing on Modern Hardware
Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 c Jens Teubner Data Processing on Modern Hardware Summer 2014 1 Part V Execution on Multiple
More informationNON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014
NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 14 November 2014 Lecture 6 Introduction Amdahl s law Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading
More informationAdaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >
Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationDiffusion TM 5.0 Performance Benchmarks
Diffusion TM 5.0 Performance Benchmarks Contents Introduction 3 Benchmark Overview 3 Methodology 4 Results 5 Conclusion 7 Appendix A Environment 8 Diffusion TM 5.0 Performance Benchmarks 2 1 Introduction
More informationConcurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis
oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to
More informationRevisiting the Combining Synchronization Technique
Revisiting the Combining Synchronization Technique Panagiota Fatourou Department of Computer Science University of Crete & FORTH ICS faturu@csd.uoc.gr Nikolaos D. Kallimanis Department of Computer Science
More informationNew Oracle NoSQL Database APIs that Speed Insertion and Retrieval
New Oracle NoSQL Database APIs that Speed Insertion and Retrieval O R A C L E W H I T E P A P E R F E B R U A R Y 2 0 1 6 1 NEW ORACLE NoSQL DATABASE APIs that SPEED INSERTION AND RETRIEVAL Introduction
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationCPHash: A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-211-51 November 26, 211 : A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek
More informationMidterm Exam Amy Murphy 19 March 2003
University of Rochester Midterm Exam Amy Murphy 19 March 2003 Computer Systems (CSC2/456) Read before beginning: Please write clearly. Illegible answers cannot be graded. Be sure to identify all of your
More informationMultithreading Deep Dive. my twitter. a deep investigation on how.net multithreading primitives map to hardware and Windows Kernel.
Multithreading Deep Dive a deep investigation on how.net multithreading primitives map to hardware and Windows Kernel Gaël Fraiteur PostSharp Technologies Founder & Principal Engineer my twitter Objectives
More informationSpinlocks. Spinlocks. Message Systems, Inc. April 8, 2011
Spinlocks Samy Al Bahra Devon H. O Dell Message Systems, Inc. April 8, 2011 Introduction Mutexes A mutex is an object which implements acquire and relinquish operations such that the execution following
More informationOracle Database 12c: JMS Sharded Queues
Oracle Database 12c: JMS Sharded Queues For high performance, scalable Advanced Queuing ORACLE WHITE PAPER MARCH 2015 Table of Contents Introduction 2 Architecture 3 PERFORMANCE OF AQ-JMS QUEUES 4 PERFORMANCE
More informationThird Midterm Exam April 24, 2017 CS162 Operating Systems
University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2017 Ion Stoica Third Midterm Exam April 24, 2017 CS162 Operating Systems Your Name: SID AND 162 Login: TA
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationThe Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith
The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationDesigning High Performance DSM Systems using InfiniBand Features
Designing High Performance DSM Systems using InfiniBand Features Ranjit Noronha and Dhabaleswar K. Panda The Ohio State University NBC Outline Introduction Motivation Design and Implementation Results
More informationNUMA-Aware Reader-Writer Locks PPoPP 2013
NUMA-Aware Reader-Writer Locks PPoPP 2013 Irina Calciu Brown University Authors Irina Calciu @ Brown University Dave Dice Yossi Lev Victor Luchangco Virendra J. Marathe Nir Shavit @ MIT 2 Cores Chip (node)
More informationScalable Locking. Adam Belay
Scalable Locking Adam Belay Problem: Locks can ruin performance 12 finds/sec 9 6 Locking overhead dominates 3 0 0 6 12 18 24 30 36 42 48 Cores Problem: Locks can ruin performance the locks
More informationExploring Synchronization in Cache Coherent Manycore Systems: A Case Study with Xeon Phi
217 IEEE 23rd International Conference on Parallel and Distributed Systems Exploring Synchronization in Cache Coherent Manycore Systems: A Case Study with Xeon Phi Xin He, Zhiwen Chen, Jianhua Sun, Hao
More informationFlexNIC: Rethinking Network DMA
FlexNIC: Rethinking Network DMA Antoine Kaufmann Simon Peter Tom Anderson Arvind Krishnamurthy University of Washington HotOS 2015 Networks: Fast and Growing Faster 1 T 400 GbE Ethernet Bandwidth [bits/s]
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationMemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing
To appear in Proceedings of the 1th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), Lombard, IL, April 213 MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter
More informationIncrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?
Incrementally Parallelizing Database Transactions with Thread-Level Speculation Todd C. Mowry Carnegie Mellon University (in collaboration with Chris Colohan, J. Gregory Steffan, and Anastasia Ailamaki)
More informationNUMA-aware Reader-Writer Locks. Tom Herold, Marco Lamina NUMA Seminar
04.02.2015 NUMA Seminar Agenda 1. Recap: Locking 2. in NUMA Systems 3. RW 4. Implementations 5. Hands On Why Locking? Parallel tasks access shared resources : Synchronization mechanism in concurrent environments
More informationAdvanced Topic: Efficient Synchronization
Advanced Topic: Efficient Synchronization Multi-Object Programs What happens when we try to synchronize across multiple objects in a large program? Each object with its own lock, condition variables Is
More informationSELF-TUNING HTM. Paolo Romano
SELF-TUNING HTM Paolo Romano 2 Based on ICAC 14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX International Conference on Autonomic Computing
More informationSDSM Progression. Implementing Shared Memory on Distributed Systems. Software Distributed Shared Memory. Why a Distributed Shared Memory (DSM) System?
SDSM Progression Implementing Shared Memory on Distributed Systems Sandhya Dwarkadas University of Rochester TreadMarks shared memory for networks of workstations Cashmere-2L - 2-level shared memory system
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationLecture: Consistency Models, TM
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency
More informationCS 153 Design of Operating Systems Winter 2016
CS 153 Design of Operating Systems Winter 2016 Lecture 7: Synchronization Administrivia Homework 1 Due today by the end of day Hopefully you have started on project 1 by now? Kernel-level threads (preemptable
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationTo Everyone... iii To Educators... v To Students... vi Acknowledgments... vii Final Words... ix References... x. 1 ADialogueontheBook 1
Contents To Everyone.............................. iii To Educators.............................. v To Students............................... vi Acknowledgments........................... vii Final Words..............................
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationLab 2: Threads and Processes
CS333: Operating Systems Lab Lab 2: Threads and Processes Goal The goal of this lab is to get you comfortable with writing basic multi-process / multi-threaded applications, and understanding their performance.
More informationArrakis: The Operating System is the Control Plane
Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationSpin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Focus so far: Correctness and Progress Models Accurate (we never lied to you) But idealized
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationLearning with Purpose
Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts
More informationCS4021/4521 INTRODUCTION
CS4021/4521 Advanced Computer Architecture II Prof Jeremy Jones Rm 4.16 top floor South Leinster St (SLS) jones@scss.tcd.ie South Leinster St CS4021/4521 2018 jones@scss.tcd.ie School of Computer Science
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationSpin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Real world concurrency Understanding hardware architecture What is contention How to
More informationPerformance in the Multicore Era
Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An
More informationLatency-Tolerant Software Distributed Shared Memory
Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25
More informationhow to implement any concurrent data structure
how to implement any concurrent data structure marcos k. aguilera vmware jointly with irina calciu siddhartha sen mahesh balakrishnan Where to find more information about this work How to Implement Any
More informationTransactifying Apache s Cache Module
H. Eran O. Lutzky Z. Guz I. Keidar Department of Electrical Engineering Technion Israel Institute of Technology SYSTOR 2009 The Israeli Experimental Systems Conference Outline 1 Why legacy applications
More informationl12 handout.txt l12 handout.txt Printed by Michael Walfish Feb 24, 11 12:57 Page 1/5 Feb 24, 11 12:57 Page 2/5
Feb 24, 11 12:57 Page 1/5 1 Handout for CS 372H 2 Class 12 3 24 February 2011 4 5 1. CAS / CMPXCHG 6 7 Useful operation: compare and swap, known as CAS. Says: "atomically 8 check whether a given memory
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationExploring mtcp based Single-Threaded and Multi-Threaded Web Server Design
Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design A Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Pijush Chakraborty (153050015)
More informationLogTM: Log-Based Transactional Memory
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet
More informationParallelization and Synchronization. CS165 Section 8
Parallelization and Synchronization CS165 Section 8 Multiprocessing & the Multicore Era Single-core performance stagnates (breakdown of Dennard scaling) Moore s law continues use additional transistors
More informationConcurrent Search Data Structures Can Be Blocking and Practically Wait-Free
Concurrent Search Data Structures Can Be Blocking and Practically Wait-Free Tudor David EPFL tudor.david@epfl.ch Rachid Guerraoui EPFL rachid.guerraoui@epfl.ch ARACT We argue that there is virtually no
More informationCPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.
CPSC/ECE 3220 Fall 2017 Exam 1 Name: 1. Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.) Referee / Illusionist / Glue. Circle only one of R, I, or G.
More informationLecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown
Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control
More informationMULTI-THREADED QUERIES
15-721 Project 3 Final Presentation MULTI-THREADED QUERIES Wendong Li (wendongl) Lu Zhang (lzhang3) Rui Wang (ruiw1) Project Objective Intra-operator parallelism Use multiple threads in a single executor
More informationFalcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University
Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationOptimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory
More informationHigh Performance Computing Lecture 21. Matthew Jacob Indian Institute of Science
High Performance Computing Lecture 21 Matthew Jacob Indian Institute of Science Semaphore Examples Semaphores can do more than mutex locks Example: Consider our concurrent program where process P1 reads
More informationTDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures
TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures August Ernstsson, Nicolas Melot august.ernstsson@liu.se November 2, 2017 1 Introduction The protection of shared data structures against
More informationA Scalable Lock Manager for Multicores
A Scalable Lock Manager for Multicores Hyungsoo Jung Hyuck Han Alan Fekete NICTA Samsung Electronics University of Sydney Gernot Heiser NICTA Heon Y. Yeom Seoul National University @University of Sydney
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More information@ Massachusetts Institute of Technology All rights reserved.
..... CPHASH: A Cache-Partitioned Hash Table with LRU Eviction by Zviad Metreveli MASSACHUSETTS INSTITUTE OF TECHAC'LOGY JUN 2 1 2011 LIBRARIES Submitted to the Department of Electrical Engineering and
More informationDistributed Computing
HELLENIC REPUBLIC UNIVERSITY OF CRETE Distributed Computing Graduate Course Section 3: Spin Locks and Contention Panagiota Fatourou Department of Computer Science Spin Locks and Contention In contrast
More information