Designing for Performance. Martin Thompson

Size: px
Start display at page:

Download "Designing for Performance. Martin Thompson"

Transcription

1 Designing for Performance Martin Thompson

2

3

4 Feynman is becoming a real pain. He has the greatest scientific honesty of anyone I ve ever meet - William P Rogers

5 The impact of QED cannot be overestimated. It explains everything that is not explained by gravity. It s also the most accurate theory ever tested by experiments on Earth. - Freeman Dyson

6

7 It does not matter how intelligent you are, if you guess and that guess cannot be backed up by experimental evidence then it is still a guess. - Richard Feynman

8 How do we Design for Performance?

9 1. What is Performance? 2. What is Clean & Representative? 3. Implementing efficient Models 4. How to Performance Test

10 Performance

11 Throughput (aka Bandwidth)

12 Response Time (aka Latency)

13 Scalability

14

15 Response Time Queuing Theory Utilisation

16 Pro Tip: Ensure you have sufficient capacity

17 Can we go parallel to speedup?

18 Amdahl s Law time Sequential Process A B

19 Amdahl s Law time Sequential Process A B Parallel Process A A A A A B

20 Amdahl s Law time Sequential Process A B Parallel Process A A A A A B Parallel Process B A B B B B

21 Amdahl's Law

22 Universal Scalability Law C(N) = N / (1 + α(n 1) + ((β* N) * (N 1))) C = capacity or throughput N = number of processors α = contention penalty β = coherence penalty

23 Speedup Universal Scalability Law Processors Amdahl USL

24 Clean & Representative

25 - Clean Morally uncontaminated; pure; innocent - Oxford English Dictionary

26 - Representative Serving as a portrayal or symbol of something - Oxford English Dictionary

27 - Representative Code is the best place to capture current understanding of a model

28 Abstractions

29 Rules of Abstraction 1. Don t use abstraction 2. Don t use abstraction 3. Only consider abstracting when you see at least 3 things that ARE the same 4. Abstractions must pay for themselves 5. Beware DRY, the evil siren that tricks you into abstraction

30 Abstraction Megamorphism => Branch Hell

31 Abstraction Not Representative => Big Smell

32 Abstraction Say no to big frameworks!

33

34 Pro Tip: Abstract when you are sure of the benefits

35 Law of Leaky Abstractions All non-trivial abstractions, to some extent, are leaky. - Joel Spolsky

36 Law of Leaky Abstractions The detail of underlying complexity cannot be ignored.

37 the purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise - Dijkstra

38 How can we abstract memory systems?

39 - It s about 3 bets! 1. The Temporal Bet

40 - It s about 3 bets! 1. The Temporal Bet 2. The Spatial Bet

41 - It s about 3 bets! 1. The Temporal Bet 2. The Spatial Bet 3. The Striding Bet

42 Model Implementation

43 Coupling vs Cohesion

44 Coupling vs Cohesion public class Queue { private final Object[] buffer; private final int capacity; } // Rest of the code

45 Coupling vs Cohesion public class Queue { private final Object[] buffer; private final int capacity; } // Rest of the code

46 Coupling vs Cohesion Properties Bag

47 Pro Tip: Respect Locality of Reference

48 Relationships

49 Relationships Order Book Order

50 Relationships Order Book 1 Bids * 1 Offers * Order

51 Relationships Order Book Price Price 1 Bids * 1 Offers * Order

52 Relationships Order Book Price Price 1 Bids * {ordered, FIFO} 1 Offers * {ordered, FIFO} Order

53 Pro Tip: Make friends with your Data Structures

54 Pro Tip: Document, discuss, design tests, before going to code

55 Algorithms

56 Order of Algorithms

57 Order of Algorithms Magnitude of n?

58 Pro Tip: Know the cardinality of all significant relationships

59 Pro Tip: Algorithms are your key to service time

60 Batching

61 Amortise the expensive costs

62 Natural Batching Producers

63 Natural Batching << Amortise Expensive Costs >> Batcher Producers

64 Response Time Natural Batching Typical Possible Load

65 Pro Tip: Batch processing is not just for offline

66 Branches, branches, branches...

67 Branches public void dostuff(list<string> things) { if (null == things things.isempty()) { return; } } for (String thing : things) { // Do useful work }

68 Branches public void dostuff(list<string> things) { if (null == things things.isempty()) { return; } } for (String thing : things) { // Do useful work }

69 Branches public void dostuff(list<string> things) { for (String thing : things) { // Do useful work } }

70 Pro Tip: Respect the Principle of least surprise

71 Loops

72 Loops If I had more time, I would have written a shorter letter. - Blaise Pascal

73 Loops L µops

74 Loops L µops LBB 28 µops LBB 28 µops

75 Pro Tip: Craft major loops like good prose

76 Composition

77 Composition Size matters

78 Composition Inlining is THE optimisation. - Cliff Click

79 Composition Single Responsibility

80 Pro Tip: Small atoms can compose to build anything

81 Data

82 Data

83 Data

84 Data

85 Data

86 Data

87 Pro Tip: Embrace Set Theory and FP techniques

88 Performance Testing

89 Define Performance Goals

90 Establish Design Principles

91 Aeron Design Principles 1. Garbage free in steady state running 2. Smart Batching in the message path 3. Lock-free algos in the message path 4. Non-blocking IO in the message path 5. No exceptional cases in message path 6. Apply the Single Writer Principle 7. Prefer unshared state 8. Avoid unnecessary data copies

92 How to measure response time?

93 Response Time Histograms

94 Response Time Histograms Mode

95 Response Time Histograms Mode Median

96 Response Time Histograms Mode Median Mean

97 Coordinated Omission Source: Gil Tene (Azul Systems)

98 HdrHistogram

99 Java Microbenchmark Harness

100 CPU Performance Counters

101 Performance test as part of Continuous Integration

102 Can your acceptance tests run as performance tests?

103 Build telemetry into production systems

104 AGAIN!!! Build telemetry into production systems

105 Counters of: Queue Lengths Concurrent Users Exceptions Transactions - orders, trades Etc.

106 Histograms of: Response Times Service Times Queue Lengths Concurrent Users Etc.

107 In closing

108 Clean => Uncontaminated

109 Representative => True Portrayal

110 Does it pass the Out Loud test?

111 Measure Don t Guess!!!

112

113 Questions? It does not matter how intelligent you are, if you guess and that guess cannot be backed up by experimental evidence then it is still a guess. - Richard Feynman

Designing for Performance. Martin Thompson

Designing for Performance. Martin Thompson Designing for Performance Martin Thompson - @mjpt777 Is it difficult writing software that has good performance? RDD (Resume Driven Development) http://www.semiconductors.org/main/2015_international_technology_roadmap_for_semiconductors_itrs/

More information

Rectangles All The Way Down. Martin Thompson

Rectangles All The Way Down. Martin Thompson Rectangles All The Way Down Martin Thompson - @mjpt777 The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer

More information

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson - @mjpt777 If a system does not respond in a timely manner then it is effectively unavailable 1. It s all about the Blocking

More information

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what s your preferred platform for building HFT applications? Why would you build low-latency applications on a GC ed platform? Some

More information

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what is your preferred platform for building HFT applications? Why do you build low-latency applications on a GC ed platform? Agenda

More information

Practicing at the Cutting Edge Learning and Unlearning about Performance. Martin Thompson

Practicing at the Cutting Edge Learning and Unlearning about Performance. Martin Thompson Practicing at the Cutting Edge Learning and Unlearning about Performance Martin Thompson - @mjpt777 Learning and Unlearning 1. Brief History of Java 2. Evolving Design approach 3. An evolving Hardware

More information

Understanding Hardware Transactional Memory

Understanding Hardware Transactional Memory Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence

More information

CSE Traditional Operating Systems deal with typical system software designed to be:

CSE Traditional Operating Systems deal with typical system software designed to be: CSE 6431 Traditional Operating Systems deal with typical system software designed to be: general purpose running on single processor machines Advanced Operating Systems are designed for either a special

More information

Cluster Consensus When Aeron Met Raft. Martin Thompson

Cluster Consensus When Aeron Met Raft. Martin Thompson Cluster When Aeron Met Raft Martin Thompson - @mjpt777 What does mean? con sen sus noun \ kən-ˈsen(t)-səs \ : general agreement : unanimity Source: http://www.merriam-webster.com/ con sen sus noun \ kən-ˈsen(t)-səs

More information

Computer Architecture. R. Poss

Computer Architecture. R. Poss Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion

More information

Concept of a process

Concept of a process Concept of a process In the context of this course a process is a program whose execution is in progress States of a process: running, ready, blocked Submit Ready Running Completion Blocked Concurrent

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

CS 136: Advanced Architecture. Review of Caches

CS 136: Advanced Architecture. Review of Caches 1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

CS 450 Exam 2 Mon. 11/7/2016

CS 450 Exam 2 Mon. 11/7/2016 CS 450 Exam 2 Mon. 11/7/2016 Name: Rules and Hints You may use one handwritten 8.5 11 cheat sheet (front and back). This is the only additional resource you may consult during this exam. No calculators.

More information

Disruptor Using High Performance, Low Latency Technology in the CERN Control System

Disruptor Using High Performance, Low Latency Technology in the CERN Control System Disruptor Using High Performance, Low Latency Technology in the CERN Control System ICALEPCS 2015 21/10/2015 2 The problem at hand 21/10/2015 WEB3O03 3 The problem at hand CESAR is used to control the

More information

Software Analysis. Asymptotic Performance Analysis

Software Analysis. Asymptotic Performance Analysis Software Analysis Performance Analysis Presenter: Jonathan Aldrich Performance Analysis Software Analysis 1 Asymptotic Performance Analysis How do we compare algorithm performance? Abstract away low-level

More information

Engineering Robust Server Software

Engineering Robust Server Software Engineering Robust Server Software Scalability Intro To Scalability What does scalability mean? 2 Intro To Scalability 100 Design 1 Design 2 Latency (usec) 75 50 25 0 1 2 4 8 16 What does scalability mean?

More information

Design of Concurrent and Distributed Data Structures

Design of Concurrent and Distributed Data Structures METIS Spring School, Agadir, Morocco, May 2015 Design of Concurrent and Distributed Data Structures Christoph Kirsch University of Salzburg Joint work with M. Dodds, A. Haas, T.A. Henzinger, A. Holzer,

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

COMP3151/9151 Foundations of Concurrency Lecture 8

COMP3151/9151 Foundations of Concurrency Lecture 8 1 COMP3151/9151 Foundations of Concurrency Lecture 8 Liam O Connor CSE, UNSW (and data61) 8 Sept 2017 2 Shared Data Consider the Readers and Writers problem from Lecture 6: Problem We have a large data

More information

The Java Memory Model

The Java Memory Model Jeremy Manson 1, William Pugh 1, and Sarita Adve 2 1 University of Maryland 2 University of Illinois at Urbana-Champaign Presented by John Fisher-Ogden November 22, 2005 Outline Introduction Sequential

More information

Concurrency and High Performance Reloaded

Concurrency and High Performance Reloaded Concurrency and High Performance Reloaded Disclaimer Any performance tuning advice provided in this presentation... will be wrong! 2 www.kodewerk.com Me Work as independent (a.k.a. freelancer) performance

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015 Introduction to Concurrent Software Systems CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015 1 Goals Present an overview of concurrency in software systems Review the benefits and challenges

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Reactive design patterns for microservices on multicore

Reactive design patterns for microservices on multicore Reactive Software with elegance Reactive design patterns for microservices on multicore Reactive summit - 22/10/18 charly.bechara@tredzone.com Outline Microservices on multicore Reactive Multicore Patterns

More information

Agile Architecture. The Why, the What and the How

Agile Architecture. The Why, the What and the How Agile Architecture The Why, the What and the How Copyright Net Objectives, Inc. All Rights Reserved 2 Product Portfolio Management Product Management Lean for Executives SAFe for Executives Scaled Agile

More information

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016 Introduction to Concurrent Software Systems CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016 1 Goals Present an overview of concurrency in software systems Review the benefits and challenges

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Concurrent Computing

Concurrent Computing Concurrent Computing Introduction SE205, P1, 2017 Administrivia Language: (fr)anglais? Lectures: Fridays (15.09-03.11), 13:30-16:45, Amphi Grenat Web page: https://se205.wp.imt.fr/ Exam: 03.11, 15:15-16:45

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

NAMD Serial and Parallel Performance

NAMD Serial and Parallel Performance NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance

More information

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):

More information

Efficient Java (with Stratosphere) Arvid Heise, Large Scale Duplicate Detection

Efficient Java (with Stratosphere) Arvid Heise, Large Scale Duplicate Detection Efficient Java (with Stratosphere) Arvid Heise, Large Scale Duplicate Detection Agenda 2 Bottlenecks Mutable vs. Immutable Caching/Pooling Strings Primitives Final Classloaders Exception Handling Concurrency

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Principles of Software Construction: Objects, Design, and Concurrency. The Perils of Concurrency Can't live with it. Cant live without it.

Principles of Software Construction: Objects, Design, and Concurrency. The Perils of Concurrency Can't live with it. Cant live without it. Principles of Software Construction: Objects, Design, and Concurrency The Perils of Concurrency Can't live with it. Cant live without it. Spring 2014 Charlie Garrod Christian Kästner School of Computer

More information

Setting the stage... Key Design Issues. Main purpose - Manage software system complexity by improving software quality factors

Setting the stage... Key Design Issues. Main purpose - Manage software system complexity by improving software quality factors Setting the stage... Dr. Radu Marinescu 1 1946 Key Design Issues Main purpose - Manage software system complexity by...... improving software quality factors... facilitating systematic reuse Dr. Radu Marinescu

More information

Ext4 Filesystem Scaling

Ext4 Filesystem Scaling Ext4 Filesystem Scaling Jan Kára SUSE Labs Overview Handling of orphan inodes in ext4 Shrinking cache of logical to physical block mappings Cleanup of transaction checkpoint lists 2 Orphan

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology

More information

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary. ECE C61 Computer Architecture Lecture 2 performance Prof Alok N Choudhary choudhar@ecenorthwesternedu 2-1 Today s s Lecture Performance Concepts Response Time Throughput Performance Evaluation Benchmarks

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 35 Caches IV / VM I 2004-11-19 Andy Carle inst.eecs.berkeley.edu/~cs61c-ta Google strikes back against recent encroachments into the Search

More information

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 8: Semaphores, Monitors, & Condition Variables

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 8: Semaphores, Monitors, & Condition Variables CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2004 Lecture 8: Semaphores, Monitors, & Condition Variables 8.0 Main Points: Definition of semaphores Example of use

More information

Priming Java for Speed

Priming Java for Speed Priming Java for Speed Getting Fast & Staying Fast Gil Tene, CTO & co-founder, Azul Systems 2013 Azul Systems, Inc. High level agenda Intro Java realities at Load Start A whole bunch of compiler optimization

More information

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2004

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2004 CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2004 Lecture 9: Readers-Writers and Language Support for Synchronization 9.1.2 Constraints 1. Readers can access database

More information

Lecture 12. Lecture 12: Access Methods

Lecture 12. Lecture 12: Access Methods Lecture 12 Lecture 12: Access Methods Lecture 12 If you don t find it in the index, look very carefully through the entire catalog - Sears, Roebuck and Co., Consumers Guide, 1897 2 Lecture 12 > Section

More information

Faster Objects and Arrays. Closing the [last?] inherent C vs. Java speed gap

Faster Objects and Arrays. Closing the [last?] inherent C vs. Java speed gap Faster Objects and Arrays Closing the [last?] inherent C vs. Java speed gap http://www.objectlayout.org Gil Tene, CTO & co-founder, Azul Systems About me: Gil Tene co-founder, CTO @Azul Systems Have been

More information

Real Time: Understanding the Trade-offs Between Determinism and Throughput

Real Time: Understanding the Trade-offs Between Determinism and Throughput Real Time: Understanding the Trade-offs Between Determinism and Throughput Roland Westrelin, Java Real-Time Engineering, Brian Doherty, Java Performance Engineering, Sun Microsystems, Inc TS-5609 Learn

More information

PROCESSES AND THREADS

PROCESSES AND THREADS PROCESSES AND THREADS A process is a heavyweight flow that can execute concurrently with other processes. A thread is a lightweight flow that can execute concurrently with other threads within the same

More information

Continuous Performance Testing. Mark Price Performance Engineer Improbable.io

Continuous Performance Testing. Mark Price Performance Engineer Improbable.io Continuous Performance Testing Mark Price / @epickrram Performance Engineer Improbable.io The ideal System performance testing as a first-class citizen of the continuous delivery pipeline Process Process

More information

LOW LATENCY DATA DISTRIBUTION IN CAPITAL MARKETS: GETTING IT RIGHT

LOW LATENCY DATA DISTRIBUTION IN CAPITAL MARKETS: GETTING IT RIGHT LOW LATENCY DATA DISTRIBUTION IN CAPITAL MARKETS: GETTING IT RIGHT PATRICK KUSTER Head of Business Development, Enterprise Capabilities, Thomson Reuters +358 (40) 840 7788; patrick.kuster@thomsonreuters.com

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Topics. Digital Systems Architecture EECE EECE Need More Cache? Digital Systems Architecture EECE 33-0 EECE 9-0 Need More Cache? Dr. William H. Robinson March, 00 http://eecs.vanderbilt.edu/courses/eece33/ Topics Cache: a safe place for hiding or storing things. Webster

More information

Practical Objects: Test Driven Software Development using JUnit

Practical Objects: Test Driven Software Development using JUnit 1999 McBreen.Consulting Practical Objects Test Driven Software Development using JUnit Pete McBreen, McBreen.Consulting petemcbreen@acm.org Test Driven Software Development??? The Unified Process is Use

More information

Other consistency models

Other consistency models Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012

More information

Coherence and Consistency

Coherence and Consistency Coherence and Consistency 30 The Meaning of Programs An ISA is a programming language To be useful, programs written in it must have meaning or semantics Any sequence of instructions must have a meaning.

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Atomicity via Source-to-Source Translation

Atomicity via Source-to-Source Translation Atomicity via Source-to-Source Translation Benjamin Hindman Dan Grossman University of Washington 22 October 2006 Atomic An easier-to-use and harder-to-implement primitive void deposit(int x){ synchronized(this){

More information

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

Cycle Time for Non-pipelined & Pipelined processors

Cycle Time for Non-pipelined & Pipelined processors Cycle Time for Non-pipelined & Pipelined processors Fetch Decode Execute Memory Writeback 250ps 350ps 150ps 300ps 200ps For a non-pipelined processor, the clock cycle is the sum of the latencies of all

More information

CPU Pipelining Issues

CPU Pipelining Issues CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput

More information

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering Operating System Chapter 4. Threads Lynn Choi School of Electrical Engineering Process Characteristics Resource ownership Includes a virtual address space (process image) Ownership of resources including

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

Stanford University Computer Science Department CS 140 Midterm Exam Dawson Engler Winter 1999

Stanford University Computer Science Department CS 140 Midterm Exam Dawson Engler Winter 1999 Stanford University Computer Science Department CS 140 Midterm Exam Dawson Engler Winter 1999 Name: Please initial the bottom left corner of each page. This is an open-book exam. You have 50 minutes to

More information

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance Complex question How fast is the processor? How fast your application runs?

More information

Parallel Processing: Performance Limits & Synchronization

Parallel Processing: Performance Limits & Synchronization Parallel Processing: Performance Limits & Synchronization Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. May 8, 2018 L22-1 Administrivia Quiz 3: Thursday 7:30-9:30pm, room 50-340

More information

Agenda. Highlight issues with multi threaded programming Introduce thread synchronization primitives Introduce thread safe collections

Agenda. Highlight issues with multi threaded programming Introduce thread synchronization primitives Introduce thread safe collections Thread Safety Agenda Highlight issues with multi threaded programming Introduce thread synchronization primitives Introduce thread safe collections 2 2 Need for Synchronization Creating threads is easy

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Scalability, Performance & Caching

Scalability, Performance & Caching COMP 150-IDS: Internet Scale Distributed Systems (Spring 2015) Scalability, Performance & Caching Noah Mendelsohn Tufts University Email: noah@cs.tufts.edu Web: http://www.cs.tufts.edu/~noah Copyright

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Poor co-ordination that exists in threads on JVM is bottleneck

More information

All Paging Schemes Depend on Locality. VM Page Replacement. Paging. Demand Paging

All Paging Schemes Depend on Locality. VM Page Replacement. Paging. Demand Paging 3/14/2001 1 All Paging Schemes Depend on Locality VM Page Replacement Emin Gun Sirer Processes tend to reference pages in localized patterns Temporal locality» locations referenced recently likely to be

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis Test Loads CS 147: Computer Systems Performance Analysis Test Loads 1 / 33 Overview Overview Overview 2 / 33 Test Load Design Test Load Design Test Load Design

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Reducing Seek Overhead with Application Directed Prefetching

Reducing Seek Overhead with Application Directed Prefetching Reducing Seek Overhead with Application Directed Prefetching Steve VanDeBogart, Christopher Frost, Eddie Kohler University of California, Los Angeles http://libprefetch.cs.ucla.edu Disks are Relatively

More information

ECE Spring 2017 Exam 2

ECE Spring 2017 Exam 2 ECE 56300 Spring 2017 Exam 2 All questions are worth 5 points. For isoefficiency questions, do not worry about breaking costs down to t c, t w and t s. Question 1. Innovative Big Machines has developed

More information

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY Sarvani Chadalapaka HPC Administrator University of California Merced, Office of Information Technology schadalapaka@ucmerced.edu it.ucmerced.edu

More information

Programming Paradigms for Concurrency Lecture 3 Concurrent Objects

Programming Paradigms for Concurrency Lecture 3 Concurrent Objects Programming Paradigms for Concurrency Lecture 3 Concurrent Objects Based on companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Thomas Wies New York University

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Java Without the Jitter

Java Without the Jitter TECHNOLOGY WHITE PAPER Achieving Ultra-Low Latency Table of Contents Executive Summary... 3 Introduction... 4 Why Java Pauses Can t Be Tuned Away.... 5 Modern Servers Have Huge Capacities Why Hasn t Latency

More information

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache

More information

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada CS102 Hash Tables Prof. Tejada 1 Vectors, Linked Lists, Stack, Queues, Deques Can t provide fast insertion/removal and fast lookup at the same time The Limitations of Data Structure Binary Search Trees,

More information

High Performance Data Structures: Theory and Practice. Roman Elizarov Devexperts, 2006

High Performance Data Structures: Theory and Practice. Roman Elizarov Devexperts, 2006 High Performance Data Structures: Theory and Practice Roman Elizarov Devexperts, 2006 e1 High performance? NO!!! We ll talk about different High Performance. One that aimed at avoiding all those racks

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

CHAPTER 6: PROCESS SYNCHRONIZATION

CHAPTER 6: PROCESS SYNCHRONIZATION CHAPTER 6: PROCESS SYNCHRONIZATION The slides do not contain all the information and cannot be treated as a study material for Operating System. Please refer the text book for exams. TOPICS Background

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Scalability, Performance & Caching

Scalability, Performance & Caching COMP 150-IDS: Internet Scale Distributed Systems (Spring 2018) Scalability, Performance & Caching Noah Mendelsohn Tufts University Email: noah@cs.tufts.edu Web: http://www.cs.tufts.edu/~noah Copyright

More information

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 5 Programming with Locks and Critical Sections

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 5 Programming with Locks and Critical Sections A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 5 Programming with Locks and Critical Sections Dan Grossman Last Updated: May 2012 For more information, see http://www.cs.washington.edu/homes/djg/teachingmaterials/

More information

Locks Chapter 4. Overview. Introduction Spin Locks. Queue locks. Roger Wattenhofer. Test-and-Set & Test-and-Test-and-Set Backoff lock

Locks Chapter 4. Overview. Introduction Spin Locks. Queue locks. Roger Wattenhofer. Test-and-Set & Test-and-Test-and-Set Backoff lock Locks Chapter 4 Roger Wattenhofer ETH Zurich Distributed Computing www.disco.ethz.ch Overview Introduction Spin Locks Test-and-Set & Test-and-Test-and-Set Backoff lock Queue locks 4/2 1 Introduction: From

More information