A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

Size: px
Start display at page:

Download "A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson"

Transcription

1 A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson

2

3 If a system does not respond in a timely manner then it is effectively unavailable

4 1. It s all about the Blocking 2. What do we mean by Latency 3. Adventures with Locks and queues 4. Some Alternative FIFOs 5. Where can we go Next

5 1. It s all about the Blocking

6 What is preventing progress?

7 A thread is blocked when it cannot make progress

8 There are two major causes of blocking

9 Blocking Systemic Pauses JVM Safepoints (GC, etc.) Transparent Huge Pages (Linux) Hardware (C-States, SMIs)

10 Blocking Systemic Pauses JVM Safepoints (GC, etc.) Transparent Huge Pages (Linux) Hardware (C-States, SMIs) Concurrent Algorithms Notifying Completion Mutual Exclusion (Contention) Synchronisation / Rendezvous

11 Concurrent Algorithms Locks Atomic/CAS Instructions Call into kernel on contention Always blocking Difficult to get right Execute in user space Can be non-blocking Very difficult to get right

12 2. What do we mean by Latency

13 Queuing Theory

14 Queuing Theory Service Time

15 Queuing Theory Latent/Wait Time Service Time

16 Queuing Theory Latent/Wait Time Service Time Response Time

17 Queuing Theory Latent/Wait Time Service Time Enqueue Dequeue Response Time

18 Response Time Queuing Theory Utilisation

19 Speedup Amdahl s & Universal Scalability Laws Processors Amdahl USL

20 3. Adventures with Locks and Queues?

21 Some queue implementations spend more time queueing to be enqueued than in queue time!!!

22 The evils of Blocking

23 Queue.put() && Queue.take()

24 Condition Variables

25 Condition Variable RTT Signal Ping Echo Service Histogram Record Signal Pong

26 µs

27 µs Max = µs

28 Bad News That s the best case scenario!

29 Are non-blocking APIs any better?

30 Queue.offer() && Queue.poll()

31 Context for the Benchmarks Java 8u60 Ubuntu performance mode Intel i7-3632qm 2.2 GHz (Ivy Bridge)

32 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ArrayBlockingQueue , ,984 11, ,257 19,680 LinkedBlockingQueue ,197 7, ,010 15,152 ConcurrentLinkedQueue

33 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ArrayBlockingQueue 1 30,346 41, ,631 65, ,271 90,240 LinkedBlockingQueue 1 31,422 41, ,096 94, , ,224 ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768

34 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ArrayBlockingQueue 1 30,346 41, ,631 65, ,271 90,240 LinkedBlockingQueue 1 31,422 41, ,096 94, , ,224 ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768

35 Backpressure? Size methods? Flow Rates? Garbage? Fan out?

36 5. Some alternative FIFOs

37 Inter-Thread FIFOs

38 Disruptor claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards

39 Disruptor claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards

40 Disruptor claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards

41 Disruptor claimsequence <<CAS>> 0 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards

42 Disruptor claimsequence <<CAS>> 0 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards

43 Disruptor claimsequence <<CAS>> 1 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards

44 long expectedsequence = claimedsequence - 1; while (cursor!= expectedsequence) { // busy spin } cursor = claimedsequence;

45 Disruptor gatingcache 0 gating 0 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

46 Disruptor gatingcache 0 gating 1 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

47 Disruptor gatingcache 0 gating 1 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

48 Disruptor gatingcache 0 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

49 Disruptor gatingcache 0 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

50 Disruptor gatingcache 1 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

51 Disruptor 2.0

52 Disruptor 3.0

53 What if I need something with a Queue interface?

54 ManyToOneConcurrentArrayQueue 0 headcache 0 head 0 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

55 ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

56 ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

57 ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

58 ManyToOneConcurrentArrayQueue 0 headcache 1 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

59 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue Disruptor ManyToOneConcurrentArrayQueue

60 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue Disruptor ManyToOneConcurrentArrayQueue

61 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 Disruptor ,030 8, ,159 21, ,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3, ,926 16, ,685 51,072

62 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 Disruptor ,030 8, ,159 21, ,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3, ,926 16, ,685 51,072

63 BUT!!! Disruptor has single producer and batch methods

64 Inter-Process FIFOs

65 ManyToOneRingBuffer

66 File Head Message 2 Message 3 Tail <<CAS>>

67 File Head Message 2 Message 3 Tail <<CAS>>

68 File Head Message 2 Message 3 Message 4 Tail <<CAS>>

69 File Head Message 2 Message 3 Message 4 Tail <<CAS>>

70 File Head Message 3 Message 4 Tail <<CAS>>

71 File Head Message 3 Message 4 Tail <<CAS>>

72 MPSC Ring Buffer Message R Frame Length R Message Type Encoded Message

73 Takes a throughput hit on zero ing memory but latency is good

74 Aeron IPC

75 File Tail

76 File Message 1 Tail

77 File Message 1 Message 2 Tail

78 File Message 1 Message 2 Tail

79 File Message 1 Message 2 Message 3 Tail

80 File Message 1 Message 2 Message 3 Tail

81 One big file that goes on forever?

82 No!!! Page faults, page cache churn, VM pressure,...

83 Clean Dirty Message Message Active Message Message Message Message Tail Message Message

84 Do publishers need a CAS?

85 File Message 1 Message 2 Message 3 Message X Tail <<XADD>> Message Y

86 File Message 1 Message 2 Message 3 Message X Tail <<XADD>> Message Y

87 File Message 1 Message 2 Message 3 Message X Message Y Tail

88 File Message 1 Message 2 Message 3 Message X Message Y Tail

89 File Message 1 Message 2 Message 3 Message X Message Y Padding Tail

90 File Message 1 File Tail Message 2 Message 3 Message X Message Y Padding

91 File Message 1 File Message X Message 2 Tail Message 3 Message Y Padding

92 Have a background thread do zero ing and flow control

93 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue ManyToOneRingBuffer Aeron IPC

94 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue ManyToOneRingBuffer Aeron IPC

95 Aeron Data Message R Frame Length Version B E Flags Type R Term Offset Session ID Stream ID Term ID Encoded Message

96 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 ManyToOneRingBuffer 1 3,773 3, ,286 13, ,255 34,432 Aeron IPC 1 4,436 4, ,623 8, ,825 13,872

97 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 ManyToOneRingBuffer 1 3,773 3, ,286 13, ,255 34,432 Aeron IPC 1 4,436 4, ,623 8, ,825 13,872

98 Less False Sharing - Inlined data vs Reference Array Less Card Marking

99 Is avoiding the spinning CAS loop is a major step forward?

100 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 ManyToOneRingBuffer 1 3,773 3, ,286 13, ,255 34,432 Aeron IPC 1 4,436 4, ,623 8, ,825 13,872

101 Burst Length = 100: 99% RTT (ns) 200, , , , , ,000 80,000 60,000 40,000 20,000 0

102 Logging is a Messaging Problem What design should we use?

103 5. Where can we go Next?

104 C vs Java

105 C vs Java Spin Wait Loops Thread.onSpinWait()

106 C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types

107 C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types Memory Copying Baseline against memcpy() for differing alignment

108 C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types Memory Copying Baseline against memcpy() for differing alignment Queue Interface Break conflated concerns to reduce blocking actions

109 In closing

110 2TB RAM and 100+ vcores!!!

111 Where can I find the code?

112 Questions? Blog: Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction. - Albert Einstein

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what s your preferred platform for building HFT applications? Why would you build low-latency applications on a GC ed platform? Some

More information

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what is your preferred platform for building HFT applications? Why do you build low-latency applications on a GC ed platform? Agenda

More information

Designing for Performance. Martin Thompson

Designing for Performance. Martin Thompson Designing for Performance Martin Thompson - @mjpt777 Feynman is becoming a real pain. He has the greatest scientific honesty of anyone I ve ever meet - William P Rogers The impact of QED cannot be overestimated.

More information

Designing for Performance. Martin Thompson

Designing for Performance. Martin Thompson Designing for Performance Martin Thompson - @mjpt777 Is it difficult writing software that has good performance? RDD (Resume Driven Development) http://www.semiconductors.org/main/2015_international_technology_roadmap_for_semiconductors_itrs/

More information

Rectangles All The Way Down. Martin Thompson

Rectangles All The Way Down. Martin Thompson Rectangles All The Way Down Martin Thompson - @mjpt777 The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer

More information

OS-caused Long JVM Pauses - Deep Dive and Solutions

OS-caused Long JVM Pauses - Deep Dive and Solutions OS-caused Long JVM Pauses - Deep Dive and Solutions Zhenyun Zhuang LinkedIn Corp., Mountain View, California, USA https://www.linkedin.com/in/zhenyun Zhenyun@gmail.com 2016-4-21 Outline q Introduction

More information

Disruptor Using High Performance, Low Latency Technology in the CERN Control System

Disruptor Using High Performance, Low Latency Technology in the CERN Control System Disruptor Using High Performance, Low Latency Technology in the CERN Control System ICALEPCS 2015 21/10/2015 2 The problem at hand 21/10/2015 WEB3O03 3 The problem at hand CESAR is used to control the

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential

More information

RWLocks in Erlang/OTP

RWLocks in Erlang/OTP RWLocks in Erlang/OTP Rickard Green rickard@erlang.org What is an RWLock? Read/Write Lock Write locked Exclusive access for one thread Read locked Multiple reader threads No writer threads RWLocks Used

More information

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value

More information

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to

More information

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement" October 3, 2012 Ion Stoica http://inst.eecs.berkeley.edu/~cs162 Lecture 9 Followup: Inverted Page Table" With

More information

Practicing at the Cutting Edge Learning and Unlearning about Performance. Martin Thompson

Practicing at the Cutting Edge Learning and Unlearning about Performance. Martin Thompson Practicing at the Cutting Edge Learning and Unlearning about Performance Martin Thompson - @mjpt777 Learning and Unlearning 1. Brief History of Java 2. Evolving Design approach 3. An evolving Hardware

More information

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Workload Characterization and Optimization of TPC-H Queries on Apache Spark Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research

More information

Java Performance: The Definitive Guide

Java Performance: The Definitive Guide Java Performance: The Definitive Guide Scott Oaks Beijing Cambridge Farnham Kbln Sebastopol Tokyo O'REILLY Table of Contents Preface ix 1. Introduction 1 A Brief Outline 2 Platforms and Conventions 2 JVM

More information

The benefits and costs of writing a POSIX kernel in a high-level language

The benefits and costs of writing a POSIX kernel in a high-level language 1 / 38 The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL Should we use high-level languages to build OS kernels? 2 / 38

More information

Low Latency Java in the Real World

Low Latency Java in the Real World Low Latency Java in the Real World LMAX Exchange and the Zing JVM Mark Price, Senior Developer, LMAX Exchange Gil Tene, CTO & co-founder, Azul Systems Low Latency in the Java Real World LMAX Exchange and

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Towards Parallel, Scalable VM Services

Towards Parallel, Scalable VM Services Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th Century Simplistic Hardware View Faster Processors

More information

Java Without the Jitter

Java Without the Jitter TECHNOLOGY WHITE PAPER Achieving Ultra-Low Latency Table of Contents Executive Summary... 3 Introduction... 4 Why Java Pauses Can t Be Tuned Away.... 5 Modern Servers Have Huge Capacities Why Hasn t Latency

More information

VIRTUAL MEMORY READING: CHAPTER 9

VIRTUAL MEMORY READING: CHAPTER 9 VIRTUAL MEMORY READING: CHAPTER 9 9 MEMORY HIERARCHY Core! Processor! Core! Caching! Main! Memory! (DRAM)!! Caching!! Secondary Storage (SSD)!!!! Secondary Storage (Disk)! L cache exclusive to a single

More information

JAVA BYTE IPC: PART 1 DERIVING NIO FROM I/O. Instructor: Prasun Dewan (FB 150,

JAVA BYTE IPC: PART 1 DERIVING NIO FROM I/O. Instructor: Prasun Dewan (FB 150, JAVA BYTE IPC: PART 1 DERIVING NIO FROM I/O Instructor: Prasun Dewan (FB 150, dewan@unc.edu) ROX TUTORIAL 2 ROX ECHO APPLICATION Client Server Client Client 3 ASSIGNMENT REQUIREMENTS Client Server Client

More information

Oracle Database 12c: JMS Sharded Queues

Oracle Database 12c: JMS Sharded Queues Oracle Database 12c: JMS Sharded Queues For high performance, scalable Advanced Queuing ORACLE WHITE PAPER MARCH 2015 Table of Contents Introduction 2 Architecture 3 PERFORMANCE OF AQ-JMS QUEUES 4 PERFORMANCE

More information

KVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May

KVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May KVM PERFORMANCE OPTIMIZATIONS INTERNALS Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May 5 2011 KVM performance optimizations What is virtualization performance? Optimizations in RHEL 6.0 Selected

More information

Page 1. Goals for Today" TLB organization" CS162 Operating Systems and Systems Programming Lecture 11. Page Allocation and Replacement"

Page 1. Goals for Today TLB organization CS162 Operating Systems and Systems Programming Lecture 11. Page Allocation and Replacement Goals for Today" CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement" Finish discussion on TLBs! Page Replacement Policies! FIFO, LRU! Clock Algorithm!! Working Set/Thrashing!

More information

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Computer Systems A Programmer s Perspective 1 (Beta Draft) Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface

More information

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd. Kodewerk tm Java Performance Services The War on Latency Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd. Me Work as a performance tuning freelancer Nominated Sun Java Champion www.kodewerk.com

More information

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Deli Zhang, Brendan Lynch, and Damian Dechev University of Central Florida, Orlando, USA April 27, 2016 Mutual Exclusion

More information

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee @kleegeek davidklee.net gplus.to/kleegeek linked.com/a/davidaklee Specialties / Focus Areas / Passions: Performance Tuning & Troubleshooting Virtualization Cloud Enablement Infrastructure Architecture

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Background Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Deli Zhang, Brendan Lynch, and Damian Dechev University of Central Florida, Orlando, USA December 18,

More information

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 2015 Exam 1 Review Paul Krzyzanowski Rutgers University Fall 2016 1 Question 1 Why did the use of reference counting for remote objects prove to be impractical? Explain. It s not fault

More information

Scalable Streaming Analytics

Scalable Streaming Analytics Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according

More information

JCTools: Pit Of Performance Nitsan Copyright - TTNR Labs Limited 2018

JCTools: Pit Of Performance Nitsan Copyright - TTNR Labs Limited 2018 JCTools: Pit Of Performance Nitsan Wakart/@nitsanw 1 You Look Familiar... 2 Why me? Main JCTools contributor Performance Eng. Find me on: GitHub : nitsanw Blog : psy-lob-saw.blogspot.com Twitter : nitsanw

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput

More information

Memory Management Virtual Memory

Memory Management Virtual Memory Memory Management Virtual Memory Part of A3 course (by Theo Schouten) Biniam Gebremichael http://www.cs.ru.nl/~biniam/ Office: A6004 April 4 2005 Content Virtual memory Definition Advantage and challenges

More information

Introduction. CS3026 Operating Systems Lecture 01

Introduction. CS3026 Operating Systems Lecture 01 Introduction CS3026 Operating Systems Lecture 01 One or more CPUs Device controllers (I/O modules) Memory Bus Operating system? Computer System What is an Operating System An Operating System is a program

More information

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Executive Summary Problem: Performance and scalability of multithreaded applications

More information

Cluster Consensus When Aeron Met Raft. Martin Thompson

Cluster Consensus When Aeron Met Raft. Martin Thompson Cluster When Aeron Met Raft Martin Thompson - @mjpt777 What does mean? con sen sus noun \ kən-ˈsen(t)-səs \ : general agreement : unanimity Source: http://www.merriam-webster.com/ con sen sus noun \ kən-ˈsen(t)-səs

More information

Twitter Heron: Stream Processing at Scale

Twitter Heron: Stream Processing at Scale Twitter Heron: Stream Processing at Scale Saiyam Kohli December 8th, 2016 CIS 611 Research Paper Presentation -Sun Sunnie Chung TWITTER IS A REAL TIME ABSTRACT We process billions of events on Twitter

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Distributed caching for cloud computing

Distributed caching for cloud computing Distributed caching for cloud computing Maxime Lorrillere, Julien Sopena, Sébastien Monnet et Pierre Sens February 11, 2013 Maxime Lorrillere (LIP6/UPMC/CNRS) February 11, 2013 1 / 16 Introduction Context

More information

User-level Management of Kernel Memory

User-level Management of Kernel Memory User-level Management of Memory Andreas Haeberlen University of Karlsruhe Karlsruhe, Germany Kevin Elphinstone University of New South Wales Sydney, Australia 1 Motivation: memory Threads Files memory

More information

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

An efficient Unbounded Lock-Free Queue for Multi-Core Systems An efficient Unbounded Lock-Free Queue for Multi-Core Systems Authors: Marco Aldinucci 1, Marco Danelutto 2, Peter Kilpatrick 3, Massimiliano Meneghin 4 and Massimo Torquati 2 1 Computer Science Dept.

More information

6.033 Computer System Engineering

6.033 Computer System Engineering MIT OpenCourseWare http://ocw.mit.edu 6.033 Computer System Engineering Spring 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.033 2009 Lecture

More information

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed

More information

IME Infinite Memory Engine Technical Overview

IME Infinite Memory Engine Technical Overview 1 1 IME Infinite Memory Engine Technical Overview 2 Bandwidth, IOPs single NVMe drive 3 What does Flash mean for Storage? It's a new fundamental device for storing bits. We must treat it different from

More information

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 21 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Why not increase page size

More information

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Solution for Operating System

Solution for Operating System Solution for Operating System May 2016 Index Q.1) a). 2 b). 3 c).3-5 d).5-7 Q.2) a). 7-13 b). 13-14 Q.3) a). 15-17 b). 18-19 Q.4) a). N.A b). N.A Q.5) a). 19-25 b). N.A Q.6) a). 26 b). 27-28 c). N.A d).

More information

A Scalable Lock Manager for Multicores

A Scalable Lock Manager for Multicores A Scalable Lock Manager for Multicores Hyungsoo Jung Hyuck Han Alan Fekete NICTA Samsung Electronics University of Sydney Gernot Heiser NICTA Heon Y. Yeom Seoul National University @University of Sydney

More information

CPS 310 second midterm exam, 11/14/2014

CPS 310 second midterm exam, 11/14/2014 CPS 310 second midterm exam, 11/14/2014 Your name please: Part 1. Sticking points Consider the Java code snippet below. Is it a legal use of Java synchronization? What happens if two threads A and B call

More information

Memory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time.

Memory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time. Memory Management To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time. Basic CPUs and Physical Memory CPU cache Physical memory

More information

Java performance - not so scary after all

Java performance - not so scary after all Java performance - not so scary after all Holly Cummins IBM Hursley Labs 2009 IBM Corporation 2001 About me Joined IBM Began professional life writing event framework for WebSphere 2004 Moved to work on

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

The Z Garbage Collector Scalable Low-Latency GC in JDK 11 The Z Garbage Collector Scalable Low-Latency GC in JDK 11 Per Lidén (@perliden) Consulting Member of Technical Staff Java Platform Group, Oracle October 24, 2018 Safe Harbor Statement The following is

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year and Semester : II / IV Subject Code : CS6401 Subject Name : Operating System Degree and Branch : B.E CSE UNIT I 1. Define system process 2. What is an

More information

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that

More information

CSE 153 Design of Operating Systems

CSE 153 Design of Operating Systems CSE 153 Design of Operating Systems Winter 19 Lecture 7/8: Synchronization (1) Administrivia How is Lab going? Be prepared with questions for this weeks Lab My impression from TAs is that you are on track

More information

Pointer Analysis in the Presence of Dynamic Class Loading. Hind Presented by Brian Russell

Pointer Analysis in the Presence of Dynamic Class Loading. Hind Presented by Brian Russell Pointer Analysis in the Presence of Dynamic Class Loading Martin Hirzel, Amer Diwan and Michael Hind Presented by Brian Russell Claim: First nontrivial pointer analysis dealing with all Java language features

More information

Keeping up with the hardware

Keeping up with the hardware Keeping up with the hardware Challenges in scaling I/O performance Jonathan Davies XenServer System Performance Lead XenServer Engineering, Citrix Cambridge, UK 18 Aug 2015 Jonathan Davies (Citrix) Keeping

More information

Scaling the OpenJDK. Claes Redestad Java SE Performance Team Oracle. Copyright 2017, Oracle and/or its afliates. All rights reserved.

Scaling the OpenJDK. Claes Redestad Java SE Performance Team Oracle. Copyright 2017, Oracle and/or its afliates. All rights reserved. Scaling the OpenJDK Claes Redestad Java SE Performance Team Oracle Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only,

More information

Real Time: Understanding the Trade-offs Between Determinism and Throughput

Real Time: Understanding the Trade-offs Between Determinism and Throughput Real Time: Understanding the Trade-offs Between Determinism and Throughput Roland Westrelin, Java Real-Time Engineering, Brian Doherty, Java Performance Engineering, Sun Microsystems, Inc TS-5609 Learn

More information

Modern C++ Old Dog, New Tricks Todd L.

Modern C++ Old Dog, New Tricks Todd L. StoneTor Modern C++ Old Dog, New Tricks Todd L. Montgomery @toddlmontgomery C++ is so old Languages are Tools Learning Tools is Good There are only two kinds of languages: the ones people complain about

More information

Initial Evaluation of a User-Level Device Driver Framework

Initial Evaluation of a User-Level Device Driver Framework Initial Evaluation of a User-Level Device Driver Framework Stefan Götz Karlsruhe University Germany sgoetz@ira.uka.de Kevin Elphinstone National ICT Australia University of New South Wales kevine@cse.unsw.edu.au

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction : A Distributed Shared Memory System Based on Multiple Java Virtual Machines X. Chen and V.H. Allan Computer Science Department, Utah State University 1998 : Introduction Built on concurrency supported

More information

Much Faster Networking

Much Faster Networking Much Faster Networking David Riddoch driddoch@solarflare.com Copyright 2016 Solarflare Communications, Inc. All rights reserved. What is kernel bypass? The standard receive path The standard receive path

More information

3/3/2014! Anthony D. Joseph!!CS162! UCB Spring 2014!

3/3/2014! Anthony D. Joseph!!CS162! UCB Spring 2014! Post Project 1 Class Format" CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement" Mini quizzes after each topic Not graded Simple True/False Immediate feedback for

More information

Synchronization COMPSCI 386

Synchronization COMPSCI 386 Synchronization COMPSCI 386 Obvious? // push an item onto the stack while (top == SIZE) ; stack[top++] = item; // pop an item off the stack while (top == 0) ; item = stack[top--]; PRODUCER CONSUMER Suppose

More information

QUESTION BANK UNIT I

QUESTION BANK UNIT I QUESTION BANK Subject Name: Operating Systems UNIT I 1) Differentiate between tightly coupled systems and loosely coupled systems. 2) Define OS 3) What are the differences between Batch OS and Multiprogramming?

More information

wait with priority An enhanced version of the wait operation accepts an optional priority argument:

wait with priority An enhanced version of the wait operation accepts an optional priority argument: wait with priority An enhanced version of the wait operation accepts an optional priority argument: syntax: .wait the smaller the value of the parameter, the highest the priority

More information

Improve Linux User-Space Core Libraries with Restartable Sequences

Improve Linux User-Space Core Libraries with Restartable Sequences Open Source Summit 2018 Improve Linux User-Space Core Libraries with Restartable Sequences mathieu.desnoyers@efcios.com Speaker Mathieu Desnoyers CEO at EfficiOS Inc. Maintainer of: LTTng kernel and user-space

More information

5 Solutions. Solution a. no solution provided. b. no solution provided

5 Solutions. Solution a. no solution provided. b. no solution provided 5 Solutions Solution 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 S2 Chapter 5 Solutions Solution 5.2 5.2.1 4 5.2.2 a. I, J b. B[I][0] 5.2.3 a. A[I][J] b. A[J][I] 5.2.4 a. 3596 = 8 800/4 2 8 8/4 + 8000/4 b.

More information

Java Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE

Java Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE Java Performance Tuning From A Garbage Collection Perspective Nagendra Nagarajayya MDE Agenda Introduction To Garbage Collection Performance Problems Due To Garbage Collection Performance Tuning Manual

More information

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems

More information

Concurrent Garbage Collection

Concurrent Garbage Collection Concurrent Garbage Collection Deepak Sreedhar JVM engineer, Azul Systems Java User Group Bangalore 1 @azulsystems azulsystems.com About me: Deepak Sreedhar JVM student at Azul Systems Currently working

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Arrakis: The Operating System is the Control Plane

Arrakis: The Operating System is the Control Plane Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Virtual memory. Virtual memory - Swapping. Large virtual memory. Processes

Virtual memory. Virtual memory - Swapping. Large virtual memory. Processes Virtual memory Virtual memory - Swapping Johan Montelius KTH 2017 1: Allowing two or more processes to use main memory, given them an illusion of private memory. 2: Provide the illusion of a much larger

More information

CEC 450 Real-Time Systems

CEC 450 Real-Time Systems CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation

More information

CS 241 Honors Concurrent Data Structures

CS 241 Honors Concurrent Data Structures CS 241 Honors Concurrent Data Structures Bhuvan Venkatesh University of Illinois Urbana Champaign March 27, 2018 CS 241 Course Staff (UIUC) Lock Free Data Structures March 27, 2018 1 / 43 What to go over

More information

CPS 110 Midterm. Spring 2011

CPS 110 Midterm. Spring 2011 CPS 110 Midterm Spring 2011 Ola! Greetings from Puerto Rico, where the air is warm and salty and the mojitos are cold and sweet. Please answer all questions for a total of 200 points. Keep it clear and

More information

CIS 21 Final Study Guide. Final covers ch. 1-20, except for 17. Need to know:

CIS 21 Final Study Guide. Final covers ch. 1-20, except for 17. Need to know: CIS 21 Final Study Guide Final covers ch. 1-20, except for 17. Need to know: I. Amdahl's Law II. Moore s Law III. Processes and Threading A. What is a process? B. What is a thread? C. Modes (kernel mode,

More information

Caching and reliability

Caching and reliability Caching and reliability Block cache Vs. Latency ~10 ns 1~ ms Access unit Byte (word) Sector Capacity Gigabytes Terabytes Price Expensive Cheap Caching disk contents in RAM Hit ratio h : probability of

More information

September 15th, Finagle + Java. A love story (

September 15th, Finagle + Java. A love story ( September 15th, 2016 Finagle + Java A love story ( ) @mnnakamura hi, I m Moses Nakamura Twitter lives on the JVM When Twitter realized we couldn t stay on a Rails monolith and continue to scale at the

More information

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder JAVA PERFORMANCE PR SW2 S18 Dr. Prähofer DI Leopoldseder OUTLINE 1. What is performance? 1. Benchmarking 2. What is Java performance? 1. Interpreter vs JIT 3. Tools to measure performance 4. Memory Performance

More information

CS510 Operating System Foundations. Jonathan Walpole

CS510 Operating System Foundations. Jonathan Walpole CS510 Operating System Foundations Jonathan Walpole File System Performance File System Performance Memory mapped files - Avoid system call overhead Buffer cache - Avoid disk I/O overhead Careful data

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Java Concurrency. Towards a better life By - -

Java Concurrency. Towards a better life By - - Java Concurrency Towards a better life By - Srinivasan.raghavan@oracle.com - Vaibhav.x.choudhary@oracle.com Java Releases J2SE 6: - Collection Framework enhancement -Drag and Drop -Improve IO support J2SE

More information

Concurrent Data Structures Concurrent Algorithms 2016

Concurrent Data Structures Concurrent Algorithms 2016 Concurrent Data Structures Concurrent Algorithms 2016 Tudor David (based on slides by Vasileios Trigonakis) Tudor David 11.2016 1 Data Structures (DSs) Constructs for efficiently storing and retrieving

More information

Thread-Local. Lecture 27: Concurrency 3. Dealing with the Rest. Immutable. Whenever possible, don t share resources

Thread-Local. Lecture 27: Concurrency 3. Dealing with the Rest. Immutable. Whenever possible, don t share resources Thread-Local Lecture 27: Concurrency 3 CS 62 Fall 2016 Kim Bruce & Peter Mawhorter Some slides based on those from Dan Grossman, U. of Washington Whenever possible, don t share resources Easier to have

More information

CSE 486/586 Distributed Systems

CSE 486/586 Distributed Systems CSE 486/586 Distributed Systems Mutual Exclusion Steve Ko Computer Sciences and Engineering University at Buffalo CSE 486/586 Recap: Consensus On a synchronous system There s an algorithm that works. On

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

COP 4610: Introduction to Operating Systems (Spring 2016) Chapter 3: Process. Zhi Wang Florida State University

COP 4610: Introduction to Operating Systems (Spring 2016) Chapter 3: Process. Zhi Wang Florida State University COP 4610: Introduction to Operating Systems (Spring 2016) Chapter 3: Process Zhi Wang Florida State University Contents Process concept Process scheduling Operations on processes Inter-process communication

More information