A Scalable Ordering Primitive for Multicore Machines. Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

Size: px
Start display at page:

Download "A Scalable Ordering Primitive for Multicore Machines. Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim"

Transcription

1 A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

2 Era of multicore machines 2

3 Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly? Ordering Becomes scalability bottleneck... 3

4 Example: Read Log Update (RLU) Extension of RCU Modifes objects in a thread s local log Clock maintains correct snapshot (old vs new) Frees objects via epoch-based reclamation 4

5 Read Log Update (RLU) operation Global Clock (22) RLU header A A log/bufer to store copies (per-thread) B Log D B C D Read on start Q P Local Clock (22) E

6 RLU commit operation Global Clock (23) (22) 1. P updates clocks 2. P executes RCU-epoch Waits for Q to fnish B P Write Clock (23) ( ) D C Logical Clock maintains correctness/ordering A B C D E Maintained via atomic instructions FAA/CAS Q Local Clock (22) Q will read only old objects

7 Issue with logical clock RLU sufers from global clock contention Cache-line contention due to atomic instructions Possible to circumvent with our approach Phi 18 ARM 16 How can we achieve ordering with9 minimal timestamping overhead? 8 15 Ops/usec 6 Atomic 3 4 Ordo #cores #cores

8 Our proposed ordering primitive: Ordo Exposes a monotonically increasing clock Current hardware already provides rdtscp (X86), cntvct (ARM), stick (Sparc) Relies on a per-core invariant hardware clock Monotonically increases with constant skew regardless of dynamic frequency and voltage scaling 8

9 Challenges with Ordo Comparing two clocks Clocks are not synchronized Cores receive RESET signal at varying times Application: Modifying algorithms to use Ordo Able to compare between two timestamps 9

10 Embracing the invariant clocks Measure a global uncertainty window Ensure a new timestamp once a window is over Provides a notion of globally synchronized clock Measured ofset MUST have the invariant: Measured ofset is greater than the physical ofset Physical ofset: ofset due to RESET signal Measured ofset: physical ofset + one-way delay 1

11 Calculating global uncertainty window: ORDO_BOUNDARY Add one-way delay latency on each path C1 C1 C 2 2 C2 1) Calculate C1 timestamp 2) Notify C2 via memory 3) Get C2 timestamp T(C1): 4) Repeat steps 1-3 to get the minimum T(C2): 2 time 11

12 Calculating global uncertainty window: ORDO_BOUNDARY Add one-way delay latency on each path C1 C2 T(C2): 5 C2 C 1 3 Repeat prior steps in opposite direction Do not know which clock is ahead of the other T(C1): 8 time 12

13 Calculating global uncertainty window: ORDO_BOUNDARY Repeat steps for each pair of cores from C1 to Cn The maximum ofset is the ORDO_BOUNDARY C1 C 2 2 C2 C 1 3 C1 C2 C1 T(C2): 5 T(C1): C1 C2 3 C2 T(C2): 2 time T(C1): 8 13

14 Ordo application Applicable to any timestamp-based algorithm Expose Ordo API for these algorithms get_time(): Current hardware timestamp cmp_time(t1, t2): Compare two timestamps with uncertainty, if t1-t2 < ORDO_BOUNDARY new_time(t): Return tnew > (t + ORDO_BOUNDARY) Catch: Algorithms should handle uncertainty 14

15 Algorithms with Ordo handling uncertainty Physical to logical timestamping: Rely on cmp_time() to compare two timestamps Either defer or revert if comparison is uncertain Use new_time() to guarantee new time Physical timestamping: Use new_time() to access the global clock 15

16 Read Log Update (RLU Ordo Global ofset (3) A P B Log D B C D Read on start Q ) operation Local Clock (5) Q s core clock (5) P s local clock (22) E

17 RLU Global ofset (3) Ordo commit operation 1. P updates own clock 2. P executes RCU-epoch Waits for Q to fnish B A B P Write Clock (15) ( ) D C C D E Q Local Clock (5) Q will read only old objects

18 Algorithms modifed with Ordo RLU See er p a p our Transactional Locking (TL2) in STM Database concurrency control: OCC, MVCC Oplog used in Linux forking functionality 18

19 Evaluation Questions: Measured global ofset (ORDO_BOUNDARY) Maximum scalability of Ordo Ordo s impact on algorithms Machines confguration: 24 core, 8 socket Intel Xeon machine (Xeon) 256 core, Intel Xeon Phi (Phi) 96 core, 2 socket ARM machine (ARM) 32 core, 8 socket AMD machine (AMD) 19

20 Ofset between clocks Empirically measured ofset after reboots ORDO_BOUNDARY is the maximum ofset Machine Minimum (ns) Maximum (ns) Intel Xeon Intel Xeon phi 9 27 ARM 1 1,1 AMD

21 Timestamping with Ordo Ordo relies on hardware timestamping x faster than atomic increments Ops/usec/core 12 Xeon(Atomic) ARM(Atomic) 8 ARM(Ordo) Phi(Atomic) 8 Xeon(Ordo) 4 12 Ops/usec/core #core Phi(Ordo) AMD(Atomic) AMD(Ordo) #core 21

22 Scaling RLU with Ordo RLUOrdo is 2.1x faster on an average Still sufers from object copy and its locking RLU 2% Ops/usec Xeon Ops/usec 12 4 RLU(Ordo) 2% ARM Phi AMD #core 8 16 #core 22

23 Discussion and limitations Simplifes the design and understanding of algorithms Not a panacea Applicable when clock is contentious No skew consideration Thread ID-based timestamp comparison has its limitation 23

24 Conclusion Ordo is a scalable timestamping primitive Relies on invariant hardware clocks Exposes time-based API to the user Applied Ordo to fve concurrent algorithms Improves the scalability of algorithms by at most 39.7x across architectures 24

25 Backup Slides

26 Ofset between clocks Clocks are not synchronized 8th socket in Xeon and 2nd socket in ARM Results remain consistent even after reboots and measuring after a period of time Xeon # core Arm Ofset between clocks # core

27 Sensitivity of ORDO_BOUNDARY Varying ORDO_BOUNDARY from 1/8x 8x Cycles increases from K on Xeon machine Normalized throughput core 1-socket 8-sockets 27

28 Physical timestamping: Oplog Improves Exim performance by 1.9x at 24 cores 12k 1k Messages/sec Stock Oplog(Ordo) 8k 6k 4k 2k k #core

29 Scaling database concurrency control Improves OCC and MVCC by x for read-only (YCSB) OCCOrdo 1.24x faster than Tictoc and Silo (TPC-C) OCC OCC (Ordo) Xeon Txns/usec Txns/usec MVCC MVCC (Ordo) Phi ARM # core AMD 8 16 # core

30 Cannot use clock synchronization protocols No information on minimum bounds on message delivery between/among clocks Protocols introduce various errors Can lead to mis-synchronized clocks Larger or smaller than the actual physical ofset Lead to incorrect implementation of concurrent algorithms 3

Rethinking Serializable Multi-version Concurrency Control. Jose Faleiro and Daniel Abadi Yale University

Rethinking Serializable Multi-version Concurrency Control. Jose Faleiro and Daniel Abadi Yale University Rethinking Serializable Multi-version Concurrency Control Jose Faleiro and Daniel Abadi Yale University Theory: Single- vs Multi-version Systems Single-version system T r Read X X 0 T w Write X Multi-version

More information

TicToc: Time Traveling Optimistic Concurrency Control

TicToc: Time Traveling Optimistic Concurrency Control TicToc: Time Traveling Optimistic Concurrency Control Authors: Xiangyao Yu, Andrew Pavlo, Daniel Sanchez, Srinivas Devadas Presented By: Shreejit Nair 1 Background: Optimistic Concurrency Control ØRead

More information

TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS. Xiangyao Yu, Srini Devadas CSAIL, MIT

TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS. Xiangyao Yu, Srini Devadas CSAIL, MIT TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS Xiangyao Yu, Srini Devadas CSAIL, MIT FOR FIFTY YEARS, WE HAVE RIDDEN MOORE S LAW Moore s Law and the scaling of clock frequency = printing press for the currency

More information

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design A Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Pijush Chakraborty (153050015)

More information

High Performance Transactions in Deuteronomy

High Performance Transactions in Deuteronomy High Performance Transactions in Deuteronomy Justin Levandoski, David Lomet, Sudipta Sengupta, Ryan Stutsman, and Rui Wang Microsoft Research Overview Deuteronomy: componentized DB stack Separates transaction,

More information

Understanding Manycore Scalability of File Systems. Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim

Understanding Manycore Scalability of File Systems. Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim Application must parallelize I/O operations Death of single core CPU scaling

More information

Kernel Synchronization I. Changwoo Min

Kernel Synchronization I. Changwoo Min 1 Kernel Synchronization I Changwoo Min 2 Summary of last lectures Tools: building, exploring, and debugging Linux kernel Core kernel infrastructure syscall, module, kernel data structures Process management

More information

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based? Agenda Designing Transactional Memory Systems Part III: Lock-based STMs Pascal Felber University of Neuchatel Pascal.Felber@unine.ch Part I: Introduction Part II: Obstruction-free STMs Part III: Lock-based

More information

OpLog: a library for scaling update-heavy data structures

OpLog: a library for scaling update-heavy data structures OpLog: a library for scaling update-heavy data structures Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich MIT CSAIL Abstract Existing techniques (e.g., RCU) can achieve good

More information

Leveraging Lock Contention to Improve Transaction Applications. University of Washington

Leveraging Lock Contention to Improve Transaction Applications. University of Washington Leveraging Lock Contention to Improve Transaction Applications Cong Yan Alvin Cheung University of Washington 1 Background Database transactions Airline ticket reservation, banking, online shopping...

More information

Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, Samuel Madden Presenter : Akshada Kulkarni Acknowledgement : Author s slides are used with

Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, Samuel Madden Presenter : Akshada Kulkarni Acknowledgement : Author s slides are used with Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, Samuel Madden Presenter : Akshada Kulkarni Acknowledgement : Author s slides are used with some additions/ modifications 1 Introduction Design Evaluation

More information

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant Big and Fast Anti-Caching in OLTP Systems Justin DeBrabant Online Transaction Processing transaction-oriented small footprint write-intensive 2 A bit of history 3 OLTP Through the Years relational model

More information

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) Cost of Concurrency in Hybrid Transactional Memory Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) 1 Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today

More information

FOEDUS: OLTP Engine for a Thousand Cores and NVRAM

FOEDUS: OLTP Engine for a Thousand Cores and NVRAM FOEDUS: OLTP Engine for a Thousand Cores and NVRAM Hideaki Kimura HP Labs, Palo Alto, CA Slides By : Hideaki Kimura Presented By : Aman Preet Singh Next-Generation Server Hardware? HP The Machine UC Berkeley

More information

Concurrent Data Structures Concurrent Algorithms 2016

Concurrent Data Structures Concurrent Algorithms 2016 Concurrent Data Structures Concurrent Algorithms 2016 Tudor David (based on slides by Vasileios Trigonakis) Tudor David 11.2016 1 Data Structures (DSs) Constructs for efficiently storing and retrieving

More information

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, ebay, Georgia

More information

Cicada: Dependably Fast Multi-Core In-Memory Transactions

Cicada: Dependably Fast Multi-Core In-Memory Transactions Cicada: Dependably Fast Multi-Core In-Memory Transactions Hyeontaek Lim Carnegie Mellon University hl@cs.cmu.edu Michael Kaminsky Intel Labs michael.e.kaminsky@intel.com ABSTRACT dga@cs.cmu.edu able timestamp

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

Software transactional memory

Software transactional memory Transactional locking II (Dice et. al, DISC'06) Time-based STM (Felber et. al, TPDS'08) Mentor: Johannes Schneider March 16 th, 2011 Motivation Multiprocessor systems Speed up time-sharing applications

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports University of Washington Distributed storage systems

More information

Parallelism Marco Serafini

Parallelism Marco Serafini Parallelism Marco Serafini COMPSCI 590S Lecture 3 Announcements Reviews First paper posted on website Review due by this Wednesday 11 PM (hard deadline) Data Science Career Mixer (save the date!) November

More information

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek Percolator Large-Scale Incremental Processing using Distributed Transactions and Notifications D. Peng & F. Dabek Motivation Built to maintain the Google web search index Need to maintain a large repository,

More information

Scalable Concurrent Hash Tables via Relativistic Programming

Scalable Concurrent Hash Tables via Relativistic Programming Scalable Concurrent Hash Tables via Relativistic Programming Josh Triplett September 24, 2009 Speed of data < Speed of light Speed of light: 3e8 meters/second Processor speed: 3 GHz, 3e9 cycles/second

More information

Lecture X: Transactions

Lecture X: Transactions Lecture X: Transactions CMPT 401 Summer 2007 Dr. Alexandra Fedorova Transactions A transaction is a collection of actions logically belonging together To the outside world, a transaction must appear as

More information

Pay Migration Tax to Homeland: Anchor-based Scalable Reference Counting for Multicores

Pay Migration Tax to Homeland: Anchor-based Scalable Reference Counting for Multicores Pay Migration Tax to Homeland: Anchor-based Scalable Reference Counting for Multicores Seokyong Jung, Jongbin Kim, Minsoo Ryu, Sooyong Kang, Hyungsoo Jung Hanyang University 1 Reference counting It is

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Taking a Virtual Machine Towards Many-Cores. Rickard Green - Patrik Nyblom -

Taking a Virtual Machine Towards Many-Cores. Rickard Green - Patrik Nyblom - Taking a Virtual Machine Towards Many-Cores Rickard Green - rickard@erlang.org Patrik Nyblom - pan@erlang.org What we all know by now Number of cores per processor AMD Opteron Intel Xeon Sparc Niagara

More information

An Evaluation of Distributed Concurrency Control. Harding, Aken, Pavlo and Stonebraker Presented by: Thamir Qadah For CS590-BDS

An Evaluation of Distributed Concurrency Control. Harding, Aken, Pavlo and Stonebraker Presented by: Thamir Qadah For CS590-BDS An Evaluation of Distributed Concurrency Control Harding, Aken, Pavlo and Stonebraker Presented by: Thamir Qadah For CS590-BDS 1 Outline Motivation System Architecture Implemented Distributed CC protocols

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication DB Reading Group Fall 2015 slides by Dana Van Aken Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports

More information

The Flooding Time Synchronization Protocol

The Flooding Time Synchronization Protocol The Flooding Time Synchronization Protocol Miklos Maroti, Branislav Kusy, Gyula Simon and Akos Ledeczi Vanderbilt University Contributions Better understanding of the uncertainties of radio message delivery

More information

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

High-Performance ACID via Modular Concurrency Control

High-Performance ACID via Modular Concurrency Control FALL 2015 High-Performance ACID via Modular Concurrency Control Chao Xie 1, Chunzhi Su 1, Cody Littley 1, Lorenzo Alvisi 1, Manos Kapritsos 2, Yang Wang 3 (slides by Mrigesh) TODAY S READING Background

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { } return ++seqno; // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno()

More information

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne Chapter 4: Threads Silberschatz, Galvin and Gagne Chapter 4: Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Linux Threads 4.2 Silberschatz, Galvin and

More information

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Philip A. Bernstein Microsoft Research Redmond, WA, USA phil.bernstein@microsoft.com Sudipto Das Microsoft Research

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Benchmark Performance Results for Pervasive PSQL v11. A Pervasive PSQL White Paper September 2010

Benchmark Performance Results for Pervasive PSQL v11. A Pervasive PSQL White Paper September 2010 Benchmark Performance Results for Pervasive PSQL v11 A Pervasive PSQL White Paper September 2010 Table of Contents Executive Summary... 3 Impact Of New Hardware Architecture On Applications... 3 The Design

More information

Big Data Technology Incremental Processing using Distributed Transactions

Big Data Technology Incremental Processing using Distributed Transactions Big Data Technology Incremental Processing using Distributed Transactions Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov and Ohad Shacham Roadmap Previous classes Stream

More information

Leveraging Lock Contention to Improve Transaction Applications. University of Washington

Leveraging Lock Contention to Improve Transaction Applications. University of Washington Leveraging Lock Contention to Improve Transaction Applications Cong Yan Alvin Cheung University of Washington Background Pessimistic locking-based CC protocols Poor performance under high data contention

More information

Hardware Support for NVM Programming

Hardware Support for NVM Programming Hardware Support for NVM Programming 1 Outline Ordering Transactions Write endurance 2 Volatile Memory Ordering Write-back caching Improves performance Reorders writes to DRAM STORE A STORE B CPU CPU B

More information

Curriculum 2013 Knowledge Units Pertaining to PDC

Curriculum 2013 Knowledge Units Pertaining to PDC Curriculum 2013 Knowledge Units Pertaining to C KA KU Tier Level NumC Learning Outcome Assembly level machine Describe how an instruction is executed in a classical von Neumann machine, with organization

More information

Scaling PostgreSQL on SMP Architectures

Scaling PostgreSQL on SMP Architectures Scaling PostgreSQL on SMP Architectures Doug Tolbert, David Strong, Johney Tsai {doug.tolbert, david.strong, johney.tsai}@unisys.com PGCon 2007, Ottawa, May 21-24, 2007 Page 1 Performance vs. Scalability

More information

Consistency & Replication

Consistency & Replication Objectives Consistency & Replication Instructor: Dr. Tongping Liu To understand replication and related issues in distributed systems" To learn about how to keep multiple replicas consistent with each

More information

Netconf RCU and Breakage. Paul E. McKenney IBM Distinguished Engineer & CTO Linux Linux Technology Center IBM Corporation

Netconf RCU and Breakage. Paul E. McKenney IBM Distinguished Engineer & CTO Linux Linux Technology Center IBM Corporation RCU and Breakage Paul E. McKenney IBM Distinguished Engineer & CTO Linux Linux Technology Center Copyright 2009 IBM 2002 IBM Corporation Overview What the #$I#@(&!!! is RCU-bh for??? RCU status in mainline

More information

Silas Boyd-Wickizer. Doctor of Philosophy. at the. February 2014

Silas Boyd-Wickizer. Doctor of Philosophy. at the. February 2014 Optimizing Communication Bottlenecks in Multiprocessor Operating System Kernels by Silas Boyd-Wickizer B.S., Stanford University (2004) M.S., Stanford University (2006) Submitted to the Department of Electrical

More information

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads Chapter 4: Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads Chapter 4: Threads Objectives To introduce the notion of a

More information

Utilizing the IOMMU scalably

Utilizing the IOMMU scalably Utilizing the IOMMU scalably Omer Peleg, Adam Morrison, Benjamin Serebrin, and Dan Tsafrir USENIX ATC 15 2017711456 Shin Seok Ha 1 Introduction What is an IOMMU? Provides the translation between IO addresses

More information

What's new in MySQL 5.5? Performance/Scale Unleashed

What's new in MySQL 5.5? Performance/Scale Unleashed What's new in MySQL 5.5? Performance/Scale Unleashed Mikael Ronström Senior MySQL Architect The preceding is intended to outline our general product direction. It is intended for

More information

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech Spanner: Google's Globally-Distributed Database Presented by Maciej Swiech What is Spanner? "...Google's scalable, multi-version, globallydistributed, and synchronously replicated database." What is Spanner?

More information

Read-Copy Update in a Garbage Collected Environment. By Harshal Sheth, Aashish Welling, and Nihar Sheth

Read-Copy Update in a Garbage Collected Environment. By Harshal Sheth, Aashish Welling, and Nihar Sheth Read-Copy Update in a Garbage Collected Environment By Harshal Sheth, Aashish Welling, and Nihar Sheth Overview Read-copy update (RCU) Synchronization mechanism used in the Linux kernel Mainly used in

More information

QuartzV: Bringing Quality of Time to Virtual Machines

QuartzV: Bringing Quality of Time to Virtual Machines QuartzV: Bringing Quality of Time to Virtual Machines Sandeep D souza and Raj Rajkumar Carnegie Mellon University IEEE RTAS @ CPS Week 2018 1 A Shared Notion of Time Coordinated Actions Ordering of Events

More information

Exploiting Hardware Transactional Memory for Efficient In-Memory Transaction Processing. Hao Qian, Zhaoguo Wang, Haibing Guan, Binyu Zang, Haibo Chen

Exploiting Hardware Transactional Memory for Efficient In-Memory Transaction Processing. Hao Qian, Zhaoguo Wang, Haibing Guan, Binyu Zang, Haibo Chen Exploiting Hardware Transactional Memory for Efficient In-Memory Transaction Processing Hao Qian, Zhaoguo Wang, Haibing Guan, Binyu Zang, Haibo Chen Shanghai Key Laboratory of Scalable Computing and Systems

More information

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER 80 GBIT/S OVER IP USING DPDK Performance, Code, and Architecture Charles Shiflett Developer of next-generation

More information

Improving STM Performance with Transactional Structs 1

Improving STM Performance with Transactional Structs 1 Improving STM Performance with Transactional Structs 1 Ryan Yates and Michael L. Scott University of Rochester IFL, 8-31-2016 1 This work was funded in part by the National Science Foundation under grants

More information

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, PushyDB Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, osong}@mit.edu https://github.com/jeffchan/6.824 1. Abstract PushyDB provides a more fully featured database that exposes

More information

Scalable Locking. Adam Belay

Scalable Locking. Adam Belay Scalable Locking Adam Belay Problem: Locks can ruin performance 12 finds/sec 9 6 Locking overhead dominates 3 0 0 6 12 18 24 30 36 42 48 Cores Problem: Locks can ruin performance the locks

More information

StreamBox: Modern Stream Processing on a Multicore Machine

StreamBox: Modern Stream Processing on a Multicore Machine StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu

More information

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Closing the Performance Gap Between Volatile and Persistent K-V Stores Closing the Performance Gap Between Volatile and Persistent K-V Stores Yihe Huang, Harvard University Matej Pavlovic, EPFL Virendra Marathe, Oracle Labs Margo Seltzer, Oracle Labs Tim Harris, Oracle Labs

More information

Chapter 4: Multithreaded Programming. Operating System Concepts 8 th Edition,

Chapter 4: Multithreaded Programming. Operating System Concepts 8 th Edition, Chapter 4: Multithreaded Programming, Silberschatz, Galvin and Gagne 2009 Chapter 4: Multithreaded Programming Overview Multithreading Models Thread Libraries Threading Issues 4.2 Silberschatz, Galvin

More information

Lock vs. Lock-free Memory Project proposal

Lock vs. Lock-free Memory Project proposal Lock vs. Lock-free Memory Project proposal Fahad Alduraibi Aws Ahmad Eman Elrifaei Electrical and Computer Engineering Southern Illinois University 1. Introduction The CPU performance development history

More information

Relaxed Memory Consistency

Relaxed Memory Consistency Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts. CPSC/ECE 3220 Fall 2017 Exam 1 Name: 1. Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.) Referee / Illusionist / Glue. Circle only one of R, I, or G.

More information

TicToc: Time Traveling Optimistic Concurrency Control

TicToc: Time Traveling Optimistic Concurrency Control TicToc: Time Traveling Optimistic Concurrency Control Xiangyao Yu Andrew Pavlo Daniel Sanchez Srinivas Devadas CSAIL MIT Carnegie Mellon University CSAIL MIT CSAIL MIT yy@csail.mit.edu pavlo@cs.cmu.edu

More information

No compromises: distributed transactions with consistency, availability, and performance

No compromises: distributed transactions with consistency, availability, and performance No compromises: distributed transactions with consistency, availability, and performance Aleksandar Dragojevi c, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam,

More information

COMP 633 Parallel Computing.

COMP 633 Parallel Computing. COMP 633 Parallel Computing http://www.cs.unc.edu/~prins/classes/633/ Parallel computing What is it? multiple processors cooperating to solve a single problem hopefully faster than a single processor!

More information

Time-based Transactional Memory with Scalable Time Bases

Time-based Transactional Memory with Scalable Time Bases Time-based Transactional Memory with Scalable Time Bases Torvald Riegel Dresden University of Technology, Germany torvald.riegel@inf.tudresden.de Christof Fetzer Dresden University of Technology, Germany

More information

Parallel Programming Multicore systems

Parallel Programming Multicore systems FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Real Time Linux patches: history and usage

Real Time Linux patches: history and usage Real Time Linux patches: history and usage Presentation first given at: FOSDEM 2006 Embedded Development Room See www.fosdem.org Klaas van Gend Field Application Engineer for Europe Why Linux in Real-Time

More information

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han Seoul National University, Korea Dongduk Women s University, Korea Contents Motivation and Background

More information

PrimeBase XT. A transactional engine for MySQL. Paul McCullagh SNAP Innovation GmbH

PrimeBase XT. A transactional engine for MySQL.  Paul McCullagh SNAP Innovation GmbH PrimeBase XT A transactional engine for MySQL Paul McCullagh SNAP Innovation GmbH Our Company SNAP Innovation GmbH was founded in 1996, currently 25 employees. Purpose: develop and support PrimeBase database,

More information

Main-Memory Databases 1 / 25

Main-Memory Databases 1 / 25 1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#23: Concurrency Control Part 2 (R&G ch. 17) Serializability Two-Phase Locking Deadlocks

More information

CS3210: Crash consistency

CS3210: Crash consistency CS3210: Crash consistency Kyle Harrigan 1 / 45 Administrivia Lab 4 Part C Due Tomorrow Quiz #2. Lab3-4, Ch 3-6 (read "xv6 book") Open laptop/book, no Internet 2 / 45 Summary of cs3210 Power-on -> BIOS

More information

JavaScript Zero. Real JavaScript and Zero Side-Channel Attacks. Michael Schwarz, Moritz Lipp, Daniel Gruss

JavaScript Zero. Real JavaScript and Zero Side-Channel Attacks. Michael Schwarz, Moritz Lipp, Daniel Gruss JavaScript Zero Real JavaScript and Zero Side-Channel Attacks Michael Schwarz, Moritz Lipp, Daniel Gruss 20.02.2018 www.iaik.tugraz.at 1 Michael Schwarz, Moritz Lipp, Daniel Gruss www.iaik.tugraz.at Outline

More information

A Scalable Lock Manager for Multicores

A Scalable Lock Manager for Multicores A Scalable Lock Manager for Multicores Hyungsoo Jung Hyuck Han Alan Fekete NICTA Samsung Electronics University of Sydney Gernot Heiser NICTA Heon Y. Yeom Seoul National University @University of Sydney

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Multi-core Architecture and Programming

Multi-core Architecture and Programming Multi-core Architecture and Programming Yang Quansheng( 杨全胜 ) http://www.njyangqs.com School of Computer Science & Engineering 1 http://www.njyangqs.com Process, threads and Parallel Programming Content

More information

The Case for Heterogeneous HTAP

The Case for Heterogeneous HTAP The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to

More information

THE UNIVERSITY OF CHICAGO TOWARD COORDINATION-FREE AND RECONFIGURABLE MIXED CONCURRENCY CONTROL A DISSERTATION SUBMITTED TO

THE UNIVERSITY OF CHICAGO TOWARD COORDINATION-FREE AND RECONFIGURABLE MIXED CONCURRENCY CONTROL A DISSERTATION SUBMITTED TO THE UNIVERSITY OF CHICAGO TOWARD COORDINATION-FREE AND RECONFIGURABLE MIXED CONCURRENCY CONTROL A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCE IN CANDIDACY FOR THE DEGREE

More information

How Scalable is your SMB?

How Scalable is your SMB? How Scalable is your SMB? Mark Rabinovich Visuality Systems Ltd. What is this all about? Visuality Systems Ltd. provides SMB solutions from 1998. NQE (Embedded) is an implementation of SMB client/server

More information

DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS

DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS th August DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS Stefan Nürnberger, Randolf Rotta, Gabor Drescher, Daniel Danner, Jörg Nolte ACKNOWLEDGED EVENT PROPAGATION What does it do?

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale, Steve Rago, Grzegorz Calkowski, Cezary Dubnicki,

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Applications of Paxos Algorithm

Applications of Paxos Algorithm Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1

More information

STAR: Scaling Transactions through Asymmetric Replication

STAR: Scaling Transactions through Asymmetric Replication STAR: Scaling Transactions through Asymmetric Replication arxiv:1811.259v2 [cs.db] 2 Feb 219 ABSTRACT Yi Lu MIT CSAIL yilu@csail.mit.edu In this paper, we present STAR, a new distributed in-memory database

More information

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES Divy Agrawal Department of Computer Science University of California at Santa Barbara Joint work with: Amr El Abbadi, Hatem Mahmoud, Faisal

More information

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better! Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better! Xingda Wei, Zhiyuan Dong, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

<Insert Picture Here>

<Insert Picture Here> The Other HPC: Profiling Enterprise-scale Applications Marty Itzkowitz Senior Principal SW Engineer, Oracle marty.itzkowitz@oracle.com Agenda HPC Applications

More information

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition

More information

Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts

Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts Toshiyuki Maeda and Akinori Yonezawa University of Tokyo Quiz [Environment] CPU: Intel Xeon X5570 (2.93GHz)

More information