Disruptor Using High Performance, Low Latency Technology in the CERN Control System
|
|
- Cody Rice
- 5 years ago
- Views:
Transcription
1
2 Disruptor Using High Performance, Low Latency Technology in the CERN Control System ICALEPCS /10/2015 2
3 The problem at hand 21/10/2015 WEB3O03 3
4 The problem at hand CESAR is used to control the devices in CERN experimental areas Motor Detector Power Converter CESAR Server 21/10/2015 WEB3O03 4
5 The problem at hand CESAR is used to control the devices in CERN experimental areas These devices produce 2500 event streams Motor Detector Power Converter x 2500 CESAR Server 21/10/2015 WEB3O03 5
6 The problem at hand CESAR is used to control the devices in CERN experimental areas These devices produce 2500 event streams Motor Detector Power Converter x 2500 The business logic on the CESAR server combines the data coming from these streams to calculate device states State Device A CESAR Server State Device B State Device C 21/10/2015 WEB3O03 6
7 The problem at hand CESAR is used to control the devices in CERN experimental areas These devices produce 2500 event streams Motor Detector Power Converter x 2500 The business logic on the CESAR server combines the data coming from these streams to calculate device states This concurrent processing must be properly synchronized State Device A CESAR Server State Device B State Device C 21/10/2015 WEB3O03 7
8 What happens when all flows converge? 21/10/2015 WEB3O03 8
9 21/10/2015 WEB3O03 9
10 21/10/2015 WEB3O03 10
11 21/10/2015 WEB3O03 11
12 1- Some background about the Disruptor 21/10/2015 WEB3O03 12
13 1- Some background about the Disruptor Created by LMAX, a trading company, to build a high performance Forex exchange 21/10/2015 WEB3O03 13
14 1- Some background about the Disruptor Created by LMAX, a trading company, to build a high performance Forex exchange Is the result of different trials and errors 21/10/2015 WEB3O03 14
15 1- Some background about the Disruptor Created by LMAX, a trading company, to build a high performance Forex exchange Is the result of different trials and errors Challenges the idea that CPUs are not getting any faster 21/10/2015 WEB3O03 15
16 1- Some background about the Disruptor Created by LMAX, a trading company, to build a high performance Forex exchange Is the result of different trials and errors Challenges the idea that CPUs are not getting any faster Designed to take advantage of the architecture of modern CPUs, following the concept of mechanical sympathy 21/10/2015 WEB3O03 16
17 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB L2 Cache L2 Cache 256 KB Socket L3 Cache Socket Interconnect 1 to 20 MB RAM 21/10/2015 WEB3O03 17
18 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB 1 ns L2 Cache L2 Cache 256 KB Socket L3 Cache Socket Interconnect 1 to 20 MB RAM 21/10/2015 WEB3O03 18
19 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB 1 ns L2 Cache L2 Cache 256 KB 3 ns Socket L3 Cache Socket Interconnect 1 to 20 MB RAM 21/10/2015 WEB3O03 19
20 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB 1 ns L2 Cache L2 Cache 256 KB 3 ns Socket L3 Cache 1 to 20 MB 12 ns Socket Interconnect RAM 21/10/2015 WEB3O03 20
21 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB 1 ns L2 Cache L2 Cache 256 KB 3 ns Socket L3 Cache 1 to 20 MB 12 ns Socket Interconnect RAM 65 ns 21/10/2015 WEB3O03 21
22 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 22
23 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 23
24 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 24
25 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 X Y 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 25
26 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 X Y 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 26
27 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 X Y 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 27
28 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 X Y 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 28
29 Avoiding false sharing Core 1 Core 2 1 cache line = 64 bytes (on modern x86) L1 / L2 X Y 1 to 3 ns L3 12 ns Socket Interconnect 40 ns The solution? 21/10/2015 WEB3O03 29
30 Avoiding false sharing Core 1 Core 2 1 cache line = 64 bytes (on modern x86) L1 / L2 P P X P P P P P Y P P P 1 to 3 ns L3 12 ns Socket Interconnect 40 ns The solution? Padding 21/10/2015 WEB3O03 30
31 2 - Disruptor architecture What is it? Can be viewed as a very efficient FIFO bounded queue A data structure to pass data between threads, designed to avoid contention 21/10/2015 WEB3O03 31
32 The mighty ring buffer 13 Producer Consumer 4 21/10/2015 WEB3O03 32
33 The mighty ring buffer Producer Consumer 45 21/10/2015 WEB3O03 33
34 The mighty ring buffer Producer Consumer 45 Represented internally as an array caches gets prefetched 21/10/2015 WEB3O03 34
35 The mighty ring buffer Producer Consumer 45 Represented internally as an array caches gets prefetched The sequence number is a padded long no false sharing 21/10/2015 WEB3O03 35
36 The mighty ring buffer Producer Consumer 45 Represented internally as an array caches gets prefetched The sequence number is a padded long no false sharing The memory visibility relies on the volatile sequence number no locks 21/10/2015 WEB3O03 36
37 The mighty ring buffer Producer Consumer 45 Represented internally as an array caches gets prefetched The sequence number is a padded long no false sharing The memory visibility relies on the volatile sequence number no locks Slots are preallocated no garbage collection 21/10/2015 WEB3O03 37
38 Main differences compared to a queue 21/10/2015 WEB3O03 38
39 Main differences compared to a queue Latency and jitter reduced to a minimum 21/10/2015 WEB3O03 39
40 Main differences compared to a queue Latency and jitter reduced to a minimum No garbage collection 21/10/2015 WEB3O03 40
41 Main differences compared to a queue Latency and jitter reduced to a minimum No garbage collection Can have multiple consumers organized in a graph of dependency 21/10/2015 WEB3O03 41
42 Main differences compared to a queue Latency and jitter reduced to a minimum No garbage collection Can have multiple consumers organized in a graph of dependency Consumers can use batching to catch up with producers 21/10/2015 WEB3O03 42
43 Benefits for the architecture 21/10/2015 WEB3O03 43
44 Benefits for the architecture Performance No locks, no garbage collection, CPU friendly 21/10/2015 WEB3O03 44
45 Benefits for the architecture Performance No locks, no garbage collection, CPU friendly Determinism The order in which events were processed is known Messages can be replayed to rebuild the server state 21/10/2015 WEB3O03 45
46 Benefits for the architecture Performance No locks, no garbage collection, CPU friendly Determinism The order in which events were processed is known Messages can be replayed to rebuild the server state Simplification of the code base Since the business logic runs on a single thread, there is no need to worry about concurrency 21/10/2015 WEB3O03 46
47 3 The Disruptor in CESAR 21/10/2015 WEB3O03 47
48 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer Motor Detector Power Converter 21/10/2015 WEB3O03 48
49 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer For each stream of data, the last value is kept Motor Detector Power Converter 21/10/2015 WEB3O03 49
50 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer For each stream of data, the last value is kept We make use of batching Motor Detector Power Converter 21/10/2015 WEB3O03 50
51 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer For each stream of data, the last value is kept We make use of batching At the end of a batch, the business logic is triggered and executed on a single thread Motor Detector Power Converter State A State B 21/10/2015 WEB3O03 51
52 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer For each stream of data, the last value is kept We make use of batching At the end of a batch, the business logic is triggered and executed on a single thread Publish the new states over the network, making sure that we do not block the Disruptor thread if the message broker is down Motor Detector Power Converter State A State B 21/10/2015 WEB3O03 52
53 Conclusions The Disruptor, a tool from the world of finance, fits really well in an Accelerator control system It simplified the CERN CESAR code base while handling the flow of data more efficiently It is easily integrated in an existing design to replace a queue or a full pipeline of queues The main challenge faced was to switch the developers mind-set to think in asynchronous terms 21/10/2015 WEB3O03 53
54 Useful Links The Disruptor main page with an introduction and code samples: Presentation of the Disruptor at Qcon An article from Martin Fowler: A useful presentation on Latency by Gil Tene who shows that most of what we measure during performance test is wrong: New Async logger in Log4J /10/2015 WEB3O03 54
55
Reactive design patterns for microservices on multicore
Reactive Software with elegance Reactive design patterns for microservices on multicore Reactive summit - 22/10/18 charly.bechara@tredzone.com Outline Microservices on multicore Reactive Multicore Patterns
More informationSystems I: Programming Abstractions
Systems I: Programming Abstractions Course Philosophy: The goal of this course is to help students become facile with foundational concepts in programming, including experience with algorithmic problem
More informationTuesday, April 6, Inside SQL Server
Inside SQL Server Thank you Goals What happens when a query runs? What each component does How to observe what s going on Delicious SQL Cake Delicious SQL Cake Delicious SQL Cake Delicious SQL Cake Delicious
More informationA Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson
A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson - @mjpt777 If a system does not respond in a timely manner then it is effectively unavailable 1. It s all about the Blocking
More informationLearning with Purpose
Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts
More informationCS 856 Latency in Communication Systems
CS 856 Latency in Communication Systems Winter 2010 Latency Challenges CS 856, Winter 2010, Latency Challenges 1 Overview Sources of Latency low-level mechanisms services Application Requirements Latency
More informationHigh performance and scalable architectures
High performance and scalable architectures A practical introduction to CQRS and Axon Framework Allard Buijze allard.buijze@trifork.nl Allard Buijze Software Architect at Trifork Organizers of GOTO & QCON
More informationAn efficient Unbounded Lock-Free Queue for Multi-Core Systems
An efficient Unbounded Lock-Free Queue for Multi-Core Systems Authors: Marco Aldinucci 1, Marco Danelutto 2, Peter Kilpatrick 3, Massimiliano Meneghin 4 and Massimo Torquati 2 1 Computer Science Dept.
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationPredictable Programming on a Precision Timed Architecture
Predictable Programming on a Precision Timed Architecture Ben Lickly, Isaac Liu, Hiren Patel, Edward Lee, University of California, Berkeley Sungjun Kim, Stephen Edwards, Columbia University, New York
More informationHigh Performance Transactions in Deuteronomy
High Performance Transactions in Deuteronomy Justin Levandoski, David Lomet, Sudipta Sengupta, Ryan Stutsman, and Rui Wang Microsoft Research Overview Deuteronomy: componentized DB stack Separates transaction,
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More informationPersistent Memory. High Speed and Low Latency. White Paper M-WP006
Persistent Memory High Speed and Low Latency White Paper M-WP6 Corporate Headquarters: 3987 Eureka Dr., Newark, CA 9456, USA Tel: (51) 623-1231 Fax: (51) 623-1434 E-mail: info@smartm.com Customer Service:
More informationDesign Patterns for Real-Time Computer Music Systems
Design Patterns for Real-Time Computer Music Systems Roger B. Dannenberg and Ross Bencina 4 September 2005 This document contains a set of design patterns for real time systems, particularly for computer
More informationSTEPS Towards Cache-Resident Transaction Processing
STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I
More informationEnhanced Ethernet Switching Technology. Time Applications. Rui Santos 17 / 04 / 2009
Enhanced Ethernet Switching Technology for Adaptive Hard Real- Time Applications Rui Santos (rsantos@ua.pt) 17 / 04 / 2009 Problem 2 Switched Ethernet became common in real-time communications Some interesting
More informationMuch Faster Networking
Much Faster Networking David Riddoch driddoch@solarflare.com Copyright 2016 Solarflare Communications, Inc. All rights reserved. What is kernel bypass? The standard receive path The standard receive path
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationG Robert Grimm New York University
G22.3250-001 Receiver Livelock Robert Grimm New York University Altogether Now: The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation
More informationDatabase Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu
Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based
More informationCSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double
CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double *)malloc(sizeof(double)*n*n); B = (double *)malloc(sizeof(double)*n*n);
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationRecovering Disk Storage Metrics from low level Trace events
Recovering Disk Storage Metrics from low level Trace events Progress Report Meeting May 05, 2016 Houssem Daoud Michel Dagenais École Polytechnique de Montréal Laboratoire DORSAL Agenda Introduction and
More informationIntroduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015
Introduction to Concurrent Software Systems CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015 1 Goals Present an overview of concurrency in software systems Review the benefits and challenges
More informationMarcus Gründler aixigo AG. Financial Portfolio Management on Steroids
Marcus Gründler aixigo AG Financial Portfolio Management on Steroids Marcus Gründler @marcusgruendler About me Head of Portfolio Management Systems Architect at aixigo AG, Germany www.aixigo.de JAX Finance
More informationTechniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company
Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities
More informationBasics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS
Basics DRAM ORGANIZATION DRAM Word Line Bit Line Storage element (capacitor) In/Out Buffers Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array Page 1 Basics BUS
More informationROB IN Performance Measurements
ROB IN Performance Measurements I. Mandjavidze CEA Saclay, 91191 Gif-sur-Yvette CEDEX, France ROB Complex Hardware Organisation Mode of Operation ROB Complex Software Organisation Performance Measurements
More informationIntroduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016
Introduction to Concurrent Software Systems CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016 1 Goals Present an overview of concurrency in software systems Review the benefits and challenges
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationAccelerated Programmable Services. FPGA and GPU augmented infrastructure.
Accelerated Programmable Services FPGA and GPU augmented infrastructure graeme.burnett@hatstand.com Here and Now Market data 10GbE feeds common moving to 40GbE then 100GbE Software feed handlers can barely
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationLecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)
Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache 2 Virtually Indexed Caches 24-bit virtual address, 4KB page size 12 bits offset and 12 bits
More informationTechnology for Adaptive Hard. Rui Santos, UA
HaRTES Meeting Enhanced Ethernet Switching Technology for Adaptive Hard Real-Time Applications Rui Santos, rsantos@ua.pt, UA SUMMARY 2 MOTIVATION Switched Ethernet t became common in real-time communications
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationL7: Performance. Frans Kaashoek Spring 2013
L7: Performance Frans Kaashoek kaashoek@mit.edu 6.033 Spring 2013 Overview Technology fixes some performance problems Ride the technology curves if you can Some performance requirements require thinking
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationCS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links
CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links John Wawrzynek, Krste Asanovic, with John Lazzaro and Yunsup Lee (TA) UC Berkeley Fall 2010 Unit-Transaction Level
More information! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like
Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total
More informationIntroduction to Computing and Systems Architecture
Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little
More informationCapriccio : Scalable Threads for Internet Services
Capriccio : Scalable Threads for Internet Services - Ron von Behren &et al - University of California, Berkeley. Presented By: Rajesh Subbiah Background Each incoming request is dispatched to a separate
More informationLow Latency Data Grids in Finance
Low Latency Data Grids in Finance Jags Ramnarayan Chief Architect GemStone Systems jags.ramnarayan@gemstone.com Copyright 2006, GemStone Systems Inc. All Rights Reserved. Background on GemStone Systems
More informationI/O Buffering and Streaming
I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks
More informationSpring 2017 :: CSE 506. Device Programming. Nima Honarmand
Device Programming Nima Honarmand read/write interrupt read/write Spring 2017 :: CSE 506 Device Interface (Logical View) Device Interface Components: Device registers Device Memory DMA buffers Interrupt
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationGrappa: A latency tolerant runtime for large-scale irregular applications
Grappa: A latency tolerant runtime for large-scale irregular applications Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin Computer Science & Engineering, University
More informationParallel Computing. Prof. Marco Bertini
Parallel Computing Prof. Marco Bertini Modern CPUs Historical trends in CPU performance From Data processing in exascale class computer systems, C. Moore http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf
More informationReceive Livelock. Robert Grimm New York University
Receive Livelock Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation Interrupts work well when I/O
More informationHow In-memory Database Technology and IBM soliddb Complement IBM DB2 for Extreme Speed
soliddb How In-memory Database Technology and IBM soliddb Complement IBM DB2 for Extreme Speed December 2008 Joan Monera-Llorca IBM Software Group, Information Management soliddb Technical Sales - Americas
More informationCMU /618 Practice Exercise 1
CMU 15-418/618 Practice Exercise 1 A Task Queue on a Multi-Core, Multi-Threaded CPU The figure below shows a simple single-core CPU with an and execution contexts for up to two threads of control. Core
More informationChapter 13: I/O Systems
Chapter 13: I/O Systems Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Streams Performance 13.2 Silberschatz, Galvin
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationOptimizing DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most
More informationAccelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740
Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740 A performance study with NVDIMM-N Dell EMC Engineering September 2017 A Dell EMC document category Revisions Date
More informationInput/Output Systems
Input/Output Systems CSCI 315 Operating Systems Design Department of Computer Science Notice: The slides for this lecture have been largely based on those from an earlier edition of the course text Operating
More informationFlash: an efficient and portable web server
Flash: an efficient and portable web server High Level Ideas Server performance has several dimensions Lots of different choices on how to express and effect concurrency in a program Paper argues that
More informationArchitectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad
nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationCOS 318: Operating Systems. NSF, Snapshot, Dedup and Review
COS 318: Operating Systems NSF, Snapshot, Dedup and Review Topics! NFS! Case Study: NetApp File System! Deduplication storage system! Course review 2 Network File System! Sun introduced NFS v2 in early
More informationDevice-Functionality Progression
Chapter 12: I/O Systems I/O Hardware I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Incredible variety of I/O devices Common concepts Port
More informationChapter 12: I/O Systems. I/O Hardware
Chapter 12: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations I/O Hardware Incredible variety of I/O devices Common concepts Port
More informationsecondary storage: Secondary Stg Any modern computer system will incorporate (at least) two levels of storage:
Secondary Storage 1 Any modern computer system will incorporate (at least) two levels of storage: primary storage: random access memory (RAM) typical capacity 256MB to 4GB cost per MB $0.10 typical access
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationCMU Introduction to Computer Architecture, Spring 2014 HW 5: Virtual Memory, Caching and Main Memory
CMU 18-447 Introduction to Computer Architecture, Spring 2014 HW 5: Virtual Memory, Caching and Main Memory Instructor: Prof. Onur Mutlu TAs: Rachata Ausavarungnirun, Varun Kohli, Xiao Bo Zhao, Paraj Tyle
More informationVirtual to physical address translation
Virtual to physical address translation Virtual memory with paging Page table per process Page table entry includes present bit frame number modify bit flags for protection and sharing. Page tables can
More informationCOSC 6385 Computer Architecture - Memory Hierarchy Design (III)
COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses
More informationApplication Acceleration Beyond Flash Storage
Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage
More informationINTERFACING REAL-TIME AUDIO AND FILE I/O
INTERFACING REAL-TIME AUDIO AND FILE I/O Ross Bencina portaudio.com rossbencina.com rossb@audiomulch.com @RossBencina Example source code: https://github.com/rossbencina/realtimefilestreaming Q: How to
More informationPerformance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit
Memory Management All data in memory before and after processing All instructions in memory in order to execute Memory management determines what is to be in memory Memory management activities Keeping
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationNaiad (Timely Dataflow) & Streaming Systems
Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed Data Systems Mon, Nov 7th 2016 Amine Mhedhbi What is Timely Dataflow?! What is its significance? Dataflow?! Dataflow?!
More informationLecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)
Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we
More informationGFS: The Google File System
GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationChapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition
Chapter 8: Memory- Management Strategies Operating System Concepts 9 th Edition Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation
More informationdavidklee.net gplus.to/kleegeek linked.com/a/davidaklee
@kleegeek davidklee.net gplus.to/kleegeek linked.com/a/davidaklee Specialties / Focus Areas / Passions: Performance Tuning & Troubleshooting Virtualization Cloud Enablement Infrastructure Architecture
More informationChapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY
Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationChapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition
Chapter 7: Main Memory Operating System Concepts Essentials 8 th Edition Silberschatz, Galvin and Gagne 2011 Chapter 7: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure
More informationChapter 13: I/O Systems
Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Streams Performance Objectives Explore the structure of an operating
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationCache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance
6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,
More informationOptimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic
More informationImplementing Symmetric Multiprocessing in LispWorks
Implementing Symmetric Multiprocessing in LispWorks Making a multithreaded application more multithreaded Martin Simmons, LispWorks Ltd Copyright 2009 LispWorks Ltd Outline Introduction Changes in LispWorks
More informationEmbedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi
Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationTutorial 8 Build resilient, responsive and scalable web applications with SocketPro
Tutorial 8 Build resilient, responsive and scalable web applications with SocketPro Contents: Introduction SocketPro ways for resilient, responsive and scalable web applications Vertical scalability o
More informationQuestion?! Processor comparison!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 5.1-5.2!! (Over the next 2 lectures)! Lecture 18" Introduction to Memory Hierarchies! 3! Processor components! Multicore processors and programming! Question?!
More informationAsynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features
Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Xu SUN ( 孙栩 ) Peking University xusun@pku.edu.cn Motivation Neural networks -> Good Performance CNN, RNN, LSTM
More informationMySQL Database Scalability
MySQL Database Scalability Nextcloud Conference 2016 TU Berlin Oli Sennhauser Senior MySQL Consultant at FromDual GmbH oli.sennhauser@fromdual.com 1 / 14 About FromDual GmbH Support Consulting remote-dba
More informationCOT 4600 Operating Systems Fall Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM Lecture 23 Attention: project phase 4 due Tuesday November 24 Final exam Thursday December 10 4-6:50
More informationA Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria
A Predictable RTOS Mantis Cheng Department of Computer Science University of Victoria Outline I. Analysis of Timeliness Requirements II. Analysis of IO Requirements III. Time in Scheduling IV. IO in Scheduling
More informationModule 12: I/O Systems
Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Performance 12.1 I/O Hardware Incredible variety of I/O devices Common
More informationChapter 13: I/O Systems
COP 4610: Introduction to Operating Systems (Spring 2015) Chapter 13: I/O Systems Zhi Wang Florida State University Content I/O hardware Application I/O interface Kernel I/O subsystem I/O performance Objectives
More informationAC: COMPOSABLE ASYNCHRONOUS IO FOR NATIVE LANGUAGES. Tim Harris, Martín Abadi, Rebecca Isaacs & Ross McIlroy
AC: COMPOSABLE ASYNCHRONOUS IO FOR NATIVE LANGUAGES Tim Harris, Martín Abadi, Rebecca Isaacs & Ross McIlroy Synchronous IO in the Windows API Read the contents of h, and compute a result BOOL ProcessFile(HANDLE
More informationTopics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability
Topics COS 318: Operating Systems File Performance and Reliability File buffer cache Disk failure and recovery tools Consistent updates Transactions and logging 2 File Buffer Cache for Performance What
More informationMessaging Overview. Introduction. Gen-Z Messaging
Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional
More informationThe C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf
The C4 Collector Or: the Application memory wall will remain until compaction is solved Gil Tene Balaji Iyengar Michael Wolf High Level Agenda 1. The Application Memory Wall 2. Generational collection
More information