A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson
|
|
- Warren Rodgers
- 6 years ago
- Views:
Transcription
1 A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson
2
3 If a system does not respond in a timely manner then it is effectively unavailable
4 1. It s all about the Blocking 2. What do we mean by Latency 3. Adventures with Locks and queues 4. Some Alternative FIFOs 5. Where can we go Next
5 1. It s all about the Blocking
6 What is preventing progress?
7 A thread is blocked when it cannot make progress
8 There are two major causes of blocking
9 Blocking Systemic Pauses JVM Safepoints (GC, etc.) Transparent Huge Pages (Linux) Hardware (C-States, SMIs)
10 Blocking Systemic Pauses JVM Safepoints (GC, etc.) Transparent Huge Pages (Linux) Hardware (C-States, SMIs) Concurrent Algorithms Notifying Completion Mutual Exclusion (Contention) Synchronisation / Rendezvous
11 Concurrent Algorithms Locks Atomic/CAS Instructions Call into kernel on contention Always blocking Difficult to get right Execute in user space Can be non-blocking Very difficult to get right
12 2. What do we mean by Latency
13 Queuing Theory
14 Queuing Theory Service Time
15 Queuing Theory Latent/Wait Time Service Time
16 Queuing Theory Latent/Wait Time Service Time Response Time
17 Queuing Theory Latent/Wait Time Service Time Enqueue Dequeue Response Time
18 Response Time Queuing Theory Utilisation
19 Speedup Amdahl s & Universal Scalability Laws Processors Amdahl USL
20 3. Adventures with Locks and Queues?
21 Some queue implementations spend more time queueing to be enqueued than in queue time!!!
22 The evils of Blocking
23 Queue.put() && Queue.take()
24 Condition Variables
25 Condition Variable RTT Signal Ping Echo Service Histogram Record Signal Pong
26 µs
27 µs Max = µs
28 Bad News That s the best case scenario!
29 Are non-blocking APIs any better?
30 Queue.offer() && Queue.poll()
31 Context for the Benchmarks Java 8u60 Ubuntu performance mode Intel i7-3632qm 2.2 GHz (Ivy Bridge)
32 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ArrayBlockingQueue , ,984 11, ,257 19,680 LinkedBlockingQueue ,197 7, ,010 15,152 ConcurrentLinkedQueue
33 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ArrayBlockingQueue 1 30,346 41, ,631 65, ,271 90,240 LinkedBlockingQueue 1 31,422 41, ,096 94, , ,224 ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768
34 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ArrayBlockingQueue 1 30,346 41, ,631 65, ,271 90,240 LinkedBlockingQueue 1 31,422 41, ,096 94, , ,224 ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768
35 Backpressure? Size methods? Flow Rates? Garbage? Fan out?
36 5. Some alternative FIFOs
37 Inter-Thread FIFOs
38 Disruptor claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards
39 Disruptor claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards
40 Disruptor claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards
41 Disruptor claimsequence <<CAS>> 0 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards
42 Disruptor claimsequence <<CAS>> 0 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards
43 Disruptor claimsequence <<CAS>> 1 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards
44 long expectedsequence = claimedsequence - 1; while (cursor!= expectedsequence) { // busy spin } cursor = claimedsequence;
45 Disruptor gatingcache 0 gating 0 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow
46 Disruptor gatingcache 0 gating 1 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow
47 Disruptor gatingcache 0 gating 1 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow
48 Disruptor gatingcache 0 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow
49 Disruptor gatingcache 0 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow
50 Disruptor gatingcache 1 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow
51 Disruptor 2.0
52 Disruptor 3.0
53 What if I need something with a Queue interface?
54 ManyToOneConcurrentArrayQueue 0 headcache 0 head 0 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow
55 ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow
56 ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow
57 ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow
58 ManyToOneConcurrentArrayQueue 0 headcache 1 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow
59 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue Disruptor ManyToOneConcurrentArrayQueue
60 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue Disruptor ManyToOneConcurrentArrayQueue
61 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 Disruptor ,030 8, ,159 21, ,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3, ,926 16, ,685 51,072
62 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 Disruptor ,030 8, ,159 21, ,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3, ,926 16, ,685 51,072
63 BUT!!! Disruptor has single producer and batch methods
64 Inter-Process FIFOs
65 ManyToOneRingBuffer
66 File Head Message 2 Message 3 Tail <<CAS>>
67 File Head Message 2 Message 3 Tail <<CAS>>
68 File Head Message 2 Message 3 Message 4 Tail <<CAS>>
69 File Head Message 2 Message 3 Message 4 Tail <<CAS>>
70 File Head Message 3 Message 4 Tail <<CAS>>
71 File Head Message 3 Message 4 Tail <<CAS>>
72 MPSC Ring Buffer Message R Frame Length R Message Type Encoded Message
73 Takes a throughput hit on zero ing memory but latency is good
74 Aeron IPC
75 File Tail
76 File Message 1 Tail
77 File Message 1 Message 2 Tail
78 File Message 1 Message 2 Tail
79 File Message 1 Message 2 Message 3 Tail
80 File Message 1 Message 2 Message 3 Tail
81 One big file that goes on forever?
82 No!!! Page faults, page cache churn, VM pressure,...
83 Clean Dirty Message Message Active Message Message Message Message Tail Message Message
84 Do publishers need a CAS?
85 File Message 1 Message 2 Message 3 Message X Tail <<XADD>> Message Y
86 File Message 1 Message 2 Message 3 Message X Tail <<XADD>> Message Y
87 File Message 1 Message 2 Message 3 Message X Message Y Tail
88 File Message 1 Message 2 Message 3 Message X Message Y Tail
89 File Message 1 Message 2 Message 3 Message X Message Y Padding Tail
90 File Message 1 File Tail Message 2 Message 3 Message X Message Y Padding
91 File Message 1 File Message X Message 2 Tail Message 3 Message Y Padding
92 Have a background thread do zero ing and flow control
93 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue ManyToOneRingBuffer Aeron IPC
94 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue ManyToOneRingBuffer Aeron IPC
95 Aeron Data Message R Frame Length Version B E Flags Type R Term Offset Session ID Stream ID Term ID Encoded Message
96 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 ManyToOneRingBuffer 1 3,773 3, ,286 13, ,255 34,432 Aeron IPC 1 4,436 4, ,623 8, ,825 13,872
97 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 ManyToOneRingBuffer 1 3,773 3, ,286 13, ,255 34,432 Aeron IPC 1 4,436 4, ,623 8, ,825 13,872
98 Less False Sharing - Inlined data vs Reference Array Less Card Marking
99 Is avoiding the spinning CAS loop is a major step forward?
100 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 ManyToOneRingBuffer 1 3,773 3, ,286 13, ,255 34,432 Aeron IPC 1 4,436 4, ,623 8, ,825 13,872
101 Burst Length = 100: 99% RTT (ns) 200, , , , , ,000 80,000 60,000 40,000 20,000 0
102 Logging is a Messaging Problem What design should we use?
103 5. Where can we go Next?
104 C vs Java
105 C vs Java Spin Wait Loops Thread.onSpinWait()
106 C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types
107 C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types Memory Copying Baseline against memcpy() for differing alignment
108 C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types Memory Copying Baseline against memcpy() for differing alignment Queue Interface Break conflated concerns to reduce blocking actions
109 In closing
110 2TB RAM and 100+ vcores!!!
111 Where can I find the code?
112 Questions? Blog: Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction. - Albert Einstein
High Performance Managed Languages. Martin Thompson
High Performance Managed Languages Martin Thompson - @mjpt777 Really, what s your preferred platform for building HFT applications? Why would you build low-latency applications on a GC ed platform? Some
More informationHigh Performance Managed Languages. Martin Thompson
High Performance Managed Languages Martin Thompson - @mjpt777 Really, what is your preferred platform for building HFT applications? Why do you build low-latency applications on a GC ed platform? Agenda
More informationDesigning for Performance. Martin Thompson
Designing for Performance Martin Thompson - @mjpt777 Feynman is becoming a real pain. He has the greatest scientific honesty of anyone I ve ever meet - William P Rogers The impact of QED cannot be overestimated.
More informationDesigning for Performance. Martin Thompson
Designing for Performance Martin Thompson - @mjpt777 Is it difficult writing software that has good performance? RDD (Resume Driven Development) http://www.semiconductors.org/main/2015_international_technology_roadmap_for_semiconductors_itrs/
More informationRectangles All The Way Down. Martin Thompson
Rectangles All The Way Down Martin Thompson - @mjpt777 The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer
More informationOS-caused Long JVM Pauses - Deep Dive and Solutions
OS-caused Long JVM Pauses - Deep Dive and Solutions Zhenyun Zhuang LinkedIn Corp., Mountain View, California, USA https://www.linkedin.com/in/zhenyun Zhenyun@gmail.com 2016-4-21 Outline q Introduction
More informationDisruptor Using High Performance, Low Latency Technology in the CERN Control System
Disruptor Using High Performance, Low Latency Technology in the CERN Control System ICALEPCS 2015 21/10/2015 2 The problem at hand 21/10/2015 WEB3O03 3 The problem at hand CESAR is used to control the
More informationPERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES
PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential
More informationRWLocks in Erlang/OTP
RWLocks in Erlang/OTP Rickard Green rickard@erlang.org What is an RWLock? Read/Write Lock Write locked Exclusive access for one thread Read locked Multiple reader threads No writer threads RWLocks Used
More informationCascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching
Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value
More informationConcurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis
oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to
More informationCS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"
CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement" October 3, 2012 Ion Stoica http://inst.eecs.berkeley.edu/~cs162 Lecture 9 Followup: Inverted Page Table" With
More informationPracticing at the Cutting Edge Learning and Unlearning about Performance. Martin Thompson
Practicing at the Cutting Edge Learning and Unlearning about Performance Martin Thompson - @mjpt777 Learning and Unlearning 1. Brief History of Java 2. Evolving Design approach 3. An evolving Hardware
More informationWorkload Characterization and Optimization of TPC-H Queries on Apache Spark
Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research
More informationJava Performance: The Definitive Guide
Java Performance: The Definitive Guide Scott Oaks Beijing Cambridge Farnham Kbln Sebastopol Tokyo O'REILLY Table of Contents Preface ix 1. Introduction 1 A Brief Outline 2 Platforms and Conventions 2 JVM
More informationThe benefits and costs of writing a POSIX kernel in a high-level language
1 / 38 The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL Should we use high-level languages to build OS kernels? 2 / 38
More informationLow Latency Java in the Real World
Low Latency Java in the Real World LMAX Exchange and the Zing JVM Mark Price, Senior Developer, LMAX Exchange Gil Tene, CTO & co-founder, Azul Systems Low Latency in the Java Real World LMAX Exchange and
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationTowards Parallel, Scalable VM Services
Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th Century Simplistic Hardware View Faster Processors
More informationJava Without the Jitter
TECHNOLOGY WHITE PAPER Achieving Ultra-Low Latency Table of Contents Executive Summary... 3 Introduction... 4 Why Java Pauses Can t Be Tuned Away.... 5 Modern Servers Have Huge Capacities Why Hasn t Latency
More informationVIRTUAL MEMORY READING: CHAPTER 9
VIRTUAL MEMORY READING: CHAPTER 9 9 MEMORY HIERARCHY Core! Processor! Core! Caching! Main! Memory! (DRAM)!! Caching!! Secondary Storage (SSD)!!!! Secondary Storage (Disk)! L cache exclusive to a single
More informationJAVA BYTE IPC: PART 1 DERIVING NIO FROM I/O. Instructor: Prasun Dewan (FB 150,
JAVA BYTE IPC: PART 1 DERIVING NIO FROM I/O Instructor: Prasun Dewan (FB 150, dewan@unc.edu) ROX TUTORIAL 2 ROX ECHO APPLICATION Client Server Client Client 3 ASSIGNMENT REQUIREMENTS Client Server Client
More informationOracle Database 12c: JMS Sharded Queues
Oracle Database 12c: JMS Sharded Queues For high performance, scalable Advanced Queuing ORACLE WHITE PAPER MARCH 2015 Table of Contents Introduction 2 Architecture 3 PERFORMANCE OF AQ-JMS QUEUES 4 PERFORMANCE
More informationKVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May
KVM PERFORMANCE OPTIMIZATIONS INTERNALS Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May 5 2011 KVM performance optimizations What is virtualization performance? Optimizations in RHEL 6.0 Selected
More informationPage 1. Goals for Today" TLB organization" CS162 Operating Systems and Systems Programming Lecture 11. Page Allocation and Replacement"
Goals for Today" CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement" Finish discussion on TLBs! Page Replacement Policies! FIFO, LRU! Clock Algorithm!! Working Set/Thrashing!
More informationComputer Systems A Programmer s Perspective 1 (Beta Draft)
Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface
More informationKodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.
Kodewerk tm Java Performance Services The War on Latency Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd. Me Work as a performance tuning freelancer Nominated Sun Java Champion www.kodewerk.com
More informationFast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors
Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Deli Zhang, Brendan Lynch, and Damian Dechev University of Central Florida, Orlando, USA April 27, 2016 Mutual Exclusion
More informationdavidklee.net gplus.to/kleegeek linked.com/a/davidaklee
@kleegeek davidklee.net gplus.to/kleegeek linked.com/a/davidaklee Specialties / Focus Areas / Passions: Performance Tuning & Troubleshooting Virtualization Cloud Enablement Infrastructure Architecture
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationFast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors
Background Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Deli Zhang, Brendan Lynch, and Damian Dechev University of Central Florida, Orlando, USA December 18,
More informationDistributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 2015 Exam 1 Review Paul Krzyzanowski Rutgers University Fall 2016 1 Question 1 Why did the use of reference counting for remote objects prove to be impractical? Explain. It s not fault
More informationScalable Streaming Analytics
Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according
More informationJCTools: Pit Of Performance Nitsan Copyright - TTNR Labs Limited 2018
JCTools: Pit Of Performance Nitsan Wakart/@nitsanw 1 You Look Familiar... 2 Why me? Main JCTools contributor Performance Eng. Find me on: GitHub : nitsanw Blog : psy-lob-saw.blogspot.com Twitter : nitsanw
More informationMultiprocessor Systems. Chapter 8, 8.1
Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor
More informationFalcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University
Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput
More informationMemory Management Virtual Memory
Memory Management Virtual Memory Part of A3 course (by Theo Schouten) Biniam Gebremichael http://www.cs.ru.nl/~biniam/ Office: A6004 April 4 2005 Content Virtual memory Definition Advantage and challenges
More informationIntroduction. CS3026 Operating Systems Lecture 01
Introduction CS3026 Operating Systems Lecture 01 One or more CPUs Device controllers (I/O modules) Memory Bus Operating system? Computer System What is an Operating System An Operating System is a program
More informationBottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt
Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Executive Summary Problem: Performance and scalability of multithreaded applications
More informationCluster Consensus When Aeron Met Raft. Martin Thompson
Cluster When Aeron Met Raft Martin Thompson - @mjpt777 What does mean? con sen sus noun \ kən-ˈsen(t)-səs \ : general agreement : unanimity Source: http://www.merriam-webster.com/ con sen sus noun \ kən-ˈsen(t)-səs
More informationTwitter Heron: Stream Processing at Scale
Twitter Heron: Stream Processing at Scale Saiyam Kohli December 8th, 2016 CIS 611 Research Paper Presentation -Sun Sunnie Chung TWITTER IS A REAL TIME ABSTRACT We process billions of events on Twitter
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationDistributed caching for cloud computing
Distributed caching for cloud computing Maxime Lorrillere, Julien Sopena, Sébastien Monnet et Pierre Sens February 11, 2013 Maxime Lorrillere (LIP6/UPMC/CNRS) February 11, 2013 1 / 16 Introduction Context
More informationUser-level Management of Kernel Memory
User-level Management of Memory Andreas Haeberlen University of Karlsruhe Karlsruhe, Germany Kevin Elphinstone University of New South Wales Sydney, Australia 1 Motivation: memory Threads Files memory
More informationAn efficient Unbounded Lock-Free Queue for Multi-Core Systems
An efficient Unbounded Lock-Free Queue for Multi-Core Systems Authors: Marco Aldinucci 1, Marco Danelutto 2, Peter Kilpatrick 3, Massimiliano Meneghin 4 and Massimo Torquati 2 1 Computer Science Dept.
More information6.033 Computer System Engineering
MIT OpenCourseWare http://ocw.mit.edu 6.033 Computer System Engineering Spring 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.033 2009 Lecture
More informationCache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency
Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed
More informationIME Infinite Memory Engine Technical Overview
1 1 IME Infinite Memory Engine Technical Overview 2 Bandwidth, IOPs single NVMe drive 3 What does Flash mean for Storage? It's a new fundamental device for storing bits. We must treat it different from
More informationAgenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2
Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 21 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Why not increase page size
More informationTales of the Tail Hardware, OS, and Application-level Sources of Tail Latency
Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationSolution for Operating System
Solution for Operating System May 2016 Index Q.1) a). 2 b). 3 c).3-5 d).5-7 Q.2) a). 7-13 b). 13-14 Q.3) a). 15-17 b). 18-19 Q.4) a). N.A b). N.A Q.5) a). 19-25 b). N.A Q.6) a). 26 b). 27-28 c). N.A d).
More informationA Scalable Lock Manager for Multicores
A Scalable Lock Manager for Multicores Hyungsoo Jung Hyuck Han Alan Fekete NICTA Samsung Electronics University of Sydney Gernot Heiser NICTA Heon Y. Yeom Seoul National University @University of Sydney
More informationCPS 310 second midterm exam, 11/14/2014
CPS 310 second midterm exam, 11/14/2014 Your name please: Part 1. Sticking points Consider the Java code snippet below. Is it a legal use of Java synchronization? What happens if two threads A and B call
More informationMemory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time.
Memory Management To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time. Basic CPUs and Physical Memory CPU cache Physical memory
More informationJava performance - not so scary after all
Java performance - not so scary after all Holly Cummins IBM Hursley Labs 2009 IBM Corporation 2001 About me Joined IBM Began professional life writing event framework for WebSphere 2004 Moved to work on
More informationResearch on the Implementation of MPI on Multicore Architectures
Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer
More informationThe Z Garbage Collector Scalable Low-Latency GC in JDK 11
The Z Garbage Collector Scalable Low-Latency GC in JDK 11 Per Lidén (@perliden) Consulting Member of Technical Staff Java Platform Group, Oracle October 24, 2018 Safe Harbor Statement The following is
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year and Semester : II / IV Subject Code : CS6401 Subject Name : Operating System Degree and Branch : B.E CSE UNIT I 1. Define system process 2. What is an
More informationCSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)
CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that
More informationCSE 153 Design of Operating Systems
CSE 153 Design of Operating Systems Winter 19 Lecture 7/8: Synchronization (1) Administrivia How is Lab going? Be prepared with questions for this weeks Lab My impression from TAs is that you are on track
More informationPointer Analysis in the Presence of Dynamic Class Loading. Hind Presented by Brian Russell
Pointer Analysis in the Presence of Dynamic Class Loading Martin Hirzel, Amer Diwan and Michael Hind Presented by Brian Russell Claim: First nontrivial pointer analysis dealing with all Java language features
More informationKeeping up with the hardware
Keeping up with the hardware Challenges in scaling I/O performance Jonathan Davies XenServer System Performance Lead XenServer Engineering, Citrix Cambridge, UK 18 Aug 2015 Jonathan Davies (Citrix) Keeping
More informationScaling the OpenJDK. Claes Redestad Java SE Performance Team Oracle. Copyright 2017, Oracle and/or its afliates. All rights reserved.
Scaling the OpenJDK Claes Redestad Java SE Performance Team Oracle Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only,
More informationReal Time: Understanding the Trade-offs Between Determinism and Throughput
Real Time: Understanding the Trade-offs Between Determinism and Throughput Roland Westrelin, Java Real-Time Engineering, Brian Doherty, Java Performance Engineering, Sun Microsystems, Inc TS-5609 Learn
More informationModern C++ Old Dog, New Tricks Todd L.
StoneTor Modern C++ Old Dog, New Tricks Todd L. Montgomery @toddlmontgomery C++ is so old Languages are Tools Learning Tools is Good There are only two kinds of languages: the ones people complain about
More informationInitial Evaluation of a User-Level Device Driver Framework
Initial Evaluation of a User-Level Device Driver Framework Stefan Götz Karlsruhe University Germany sgoetz@ira.uka.de Kevin Elphinstone National ICT Australia University of New South Wales kevine@cse.unsw.edu.au
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationMultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction
: A Distributed Shared Memory System Based on Multiple Java Virtual Machines X. Chen and V.H. Allan Computer Science Department, Utah State University 1998 : Introduction Built on concurrency supported
More informationMuch Faster Networking
Much Faster Networking David Riddoch driddoch@solarflare.com Copyright 2016 Solarflare Communications, Inc. All rights reserved. What is kernel bypass? The standard receive path The standard receive path
More information3/3/2014! Anthony D. Joseph!!CS162! UCB Spring 2014!
Post Project 1 Class Format" CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement" Mini quizzes after each topic Not graded Simple True/False Immediate feedback for
More informationSynchronization COMPSCI 386
Synchronization COMPSCI 386 Obvious? // push an item onto the stack while (top == SIZE) ; stack[top++] = item; // pop an item off the stack while (top == 0) ; item = stack[top--]; PRODUCER CONSUMER Suppose
More informationQUESTION BANK UNIT I
QUESTION BANK Subject Name: Operating Systems UNIT I 1) Differentiate between tightly coupled systems and loosely coupled systems. 2) Define OS 3) What are the differences between Batch OS and Multiprogramming?
More informationwait with priority An enhanced version of the wait operation accepts an optional priority argument:
wait with priority An enhanced version of the wait operation accepts an optional priority argument: syntax: .wait the smaller the value of the parameter, the highest the priority
More informationImprove Linux User-Space Core Libraries with Restartable Sequences
Open Source Summit 2018 Improve Linux User-Space Core Libraries with Restartable Sequences mathieu.desnoyers@efcios.com Speaker Mathieu Desnoyers CEO at EfficiOS Inc. Maintainer of: LTTng kernel and user-space
More information5 Solutions. Solution a. no solution provided. b. no solution provided
5 Solutions Solution 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 S2 Chapter 5 Solutions Solution 5.2 5.2.1 4 5.2.2 a. I, J b. B[I][0] 5.2.3 a. A[I][J] b. A[J][I] 5.2.4 a. 3596 = 8 800/4 2 8 8/4 + 8000/4 b.
More informationJava Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE
Java Performance Tuning From A Garbage Collection Perspective Nagendra Nagarajayya MDE Agenda Introduction To Garbage Collection Performance Problems Due To Garbage Collection Performance Tuning Manual
More informationReducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet
Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems
More informationConcurrent Garbage Collection
Concurrent Garbage Collection Deepak Sreedhar JVM engineer, Azul Systems Java User Group Bangalore 1 @azulsystems azulsystems.com About me: Deepak Sreedhar JVM student at Azul Systems Currently working
More informationCA485 Ray Walshe Google File System
Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage
More informationArrakis: The Operating System is the Control Plane
Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationVirtual memory. Virtual memory - Swapping. Large virtual memory. Processes
Virtual memory Virtual memory - Swapping Johan Montelius KTH 2017 1: Allowing two or more processes to use main memory, given them an illusion of private memory. 2: Provide the illusion of a much larger
More informationCEC 450 Real-Time Systems
CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation
More informationCS 241 Honors Concurrent Data Structures
CS 241 Honors Concurrent Data Structures Bhuvan Venkatesh University of Illinois Urbana Champaign March 27, 2018 CS 241 Course Staff (UIUC) Lock Free Data Structures March 27, 2018 1 / 43 What to go over
More informationCPS 110 Midterm. Spring 2011
CPS 110 Midterm Spring 2011 Ola! Greetings from Puerto Rico, where the air is warm and salty and the mojitos are cold and sweet. Please answer all questions for a total of 200 points. Keep it clear and
More informationCIS 21 Final Study Guide. Final covers ch. 1-20, except for 17. Need to know:
CIS 21 Final Study Guide Final covers ch. 1-20, except for 17. Need to know: I. Amdahl's Law II. Moore s Law III. Processes and Threading A. What is a process? B. What is a thread? C. Modes (kernel mode,
More informationCaching and reliability
Caching and reliability Block cache Vs. Latency ~10 ns 1~ ms Access unit Byte (word) Sector Capacity Gigabytes Terabytes Price Expensive Cheap Caching disk contents in RAM Hit ratio h : probability of
More informationSeptember 15th, Finagle + Java. A love story (
September 15th, 2016 Finagle + Java A love story ( ) @mnnakamura hi, I m Moses Nakamura Twitter lives on the JVM When Twitter realized we couldn t stay on a Rails monolith and continue to scale at the
More informationJAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder
JAVA PERFORMANCE PR SW2 S18 Dr. Prähofer DI Leopoldseder OUTLINE 1. What is performance? 1. Benchmarking 2. What is Java performance? 1. Interpreter vs JIT 3. Tools to measure performance 4. Memory Performance
More informationCS510 Operating System Foundations. Jonathan Walpole
CS510 Operating System Foundations Jonathan Walpole File System Performance File System Performance Memory mapped files - Avoid system call overhead Buffer cache - Avoid disk I/O overhead Careful data
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationJava Concurrency. Towards a better life By - -
Java Concurrency Towards a better life By - Srinivasan.raghavan@oracle.com - Vaibhav.x.choudhary@oracle.com Java Releases J2SE 6: - Collection Framework enhancement -Drag and Drop -Improve IO support J2SE
More informationConcurrent Data Structures Concurrent Algorithms 2016
Concurrent Data Structures Concurrent Algorithms 2016 Tudor David (based on slides by Vasileios Trigonakis) Tudor David 11.2016 1 Data Structures (DSs) Constructs for efficiently storing and retrieving
More informationThread-Local. Lecture 27: Concurrency 3. Dealing with the Rest. Immutable. Whenever possible, don t share resources
Thread-Local Lecture 27: Concurrency 3 CS 62 Fall 2016 Kim Bruce & Peter Mawhorter Some slides based on those from Dan Grossman, U. of Washington Whenever possible, don t share resources Easier to have
More informationCSE 486/586 Distributed Systems
CSE 486/586 Distributed Systems Mutual Exclusion Steve Ko Computer Sciences and Engineering University at Buffalo CSE 486/586 Recap: Consensus On a synchronous system There s an algorithm that works. On
More informationIME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning
IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application
More informationCOP 4610: Introduction to Operating Systems (Spring 2016) Chapter 3: Process. Zhi Wang Florida State University
COP 4610: Introduction to Operating Systems (Spring 2016) Chapter 3: Process Zhi Wang Florida State University Contents Process concept Process scheduling Operations on processes Inter-process communication
More information