I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more

Size: px
Start display at page:

Download "I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more"

Transcription

1 I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete)

2 Acknowledgements Vassilis Papaefstathiou Manolis Marazakis Nikolaos Chrysos Iakovos Mavroidis John Goodacre (ARM, KALEAO) and many many more!... 2

3 The curse of being perceived as Slow Tell a computer engineer that something is slow and s/he will design-in a myriad other reasons to ensure that it will always remain slow! Optimize for the critical path and the common case Input/Output (I/O) hardware was slow Architecture (virtualization) and Software (OS, Languages, Apps) adjusted assuming that I/O is slow I/O hardware is no longer slow but Architecture & Software Inertia ensure that it remains Shared-Memory Parallel Programs work over Memory Message-Passing Parallel Programs work over I/O 3

4 I/O, nowadays, is NVM s and fast Networks Slow peripherals only indirectly connected, to few Proc s Most hi-speed processors are in Datacenters, HPC clusters Storage becomes Non-Volatile Memories like DRAM magnetic disks indirectly, through network & controllers High-speed network thruput approaches that of DRAM Short-range (e.g. on-board) network latency has no reason to be longer than DRAM latency 4

5 Memory vs. I/O once upon a time vs. now Memory is fast Virtualized in hardware, at ns scale user process believes it owns entire address space ld/st instructions, caches, TLB ensure ns scale I/O is slow don t bother to optimize rarely virtualized at user level (=process owns I/O devices) L1-miss L2-hit (on-chip) latency on the order of 10 ns on-chip PL ( I/O ) register read in Zynq UltraScale+ FPGA on the order of 100 ns (~500 ns for PCIe) Why?? because people didn t think it was worth optimizing 5

6 Make I/O (interproc. commun.) a first-class citizen! Parallel (& distributed) systems are ubiquitous nowadays Interprocessor Communication is as important as Computation Shared Memory communication limited to ~100 s threads We need communication across hw-coherence islands ( I/O to nearby chips) to be as fast as DRAM access (which is also nearby chips ) There is no reason for a (16-64 By) message send-receive to be any slower than a cache-line invalidate to a nearby socket followed by a cache-line miss from that socket 6

7 Inefficiencies in Cluster Communication, to be overcomed Sender Memory user ideal (zero copy) RDMA Receiver Memory user cp DMA Sender NIC Network Receiver NIC DMA cp kernel or lib kernel or lib Five (5) copyings of the data, instead of just one (1)! Data copying consumes time and energy Start by eliminating (large) NIC buffers: DMA across the network RDMA 7

8 DMA Initiation: System Call, or directly from User-Level? Sender user Mem user process virtual addresses ideal (zero copy) RDMA Receiver Memory user pinned buf cp kernel or lib kernel dma rd phy. addr. DMA vdma NI Network dma write ack NI ack cp pinned kernel or lib System Call to initiate a DMA (with physical addr.) 3 to 4 μs on 64-bit ARM in Xilinx Zynq UltraScale+ under Linux For the virtualized DMA engine (8 channels), that accepts virtual address arguments, Initiation (writing 3-4 registers) <0.5 μs Next Questions: Who translates addresses & where? Why copy User to/from Kernel/Library? 8

9 Scalability: Global Virtual Addresses & Progressive Translation Source Memory Hier archy read Translate (MMU) Copy (RDMA) Engine 64 bit virtual addresses src addr dst addr PID Network Global Virtual Address Space Routing Translate (MMU) write Desti nation Memory Hier archy Data for Scalability: not all nodes can be required to know all other nodes translation tables see Progressive Address Translation in the M. Katevenis paper in: Stamatis Vassiliadis Symposium 2007 Destination Addresses in network packets must be Virtual for Scalability: 64-bit GVA + (global) protection domain identifier 9

10 Current Address Translation Architectures not ready for GVAS read or write addresses, virtual or physical, but truncated to: Processor Cores Caches local DRAM vdma engine SMMU physical only addresses 44 bits 40 bits translate if virtual route based on phy. address 448 GBy window to other PS peripheral devices Address translation architecture of A53 in Xilinx Zynq UltraScale+ DMA engine is virtualized, but truncates virtual addresses to <64 b no mechanism to send the untranslated VA to the network (remote) load/store instructions suffer compulsory addr. translation Programming System PS Programmable Logic PL FPGA 10

11 Mbox Remote Enqueue: Asynchronous Event Multiplexing P1 P2 single word messages sent via single store store instruction Multiple Senders memory 2 m2 m1... then sent memory 3 Remote Queue m1m2 "Mbox" Shared Space longer messages: first composed locally... wait on empty The basic synchronization mechanism for parallel programs, I believe Event Multiplexor requests, notifications (like Select system call) P3 Single Receiver Atomic enqueue into shared space, unlike dedicated space with RDMA Space reserved only for number of actual senders not potential senders 11

12 Store-multiple instruction for atomic-packet send (remenq)? P1 Multiple Senders P2 single word messages sent via single store store instruction m1 store multiple: atomic packet?? m2... then sent m2 memory 3 Remote Queue m1m2 "Mbox" Shared Space longer messages alternative: first composed locally... Need a fast way to send a single, guaranteed-atomic network packet Store-multiple (registers) instruction exists on ARMv8, but wait on empty its current imlementation does not guarantee that all data words will be transferred in a single AXI burst to the I/O hardware P3 Single Receiver 12

13 Per-Task Mbox Q s & Scheduler as Interrupt Generalization High rity now arriving Prio Low Event Queues Q21 Q01 Q11 Processes / Tasks T21 T11 T01 Interrupt! Run Scheduler P core low power sleep when all Queues Empty switch task per time quantum Tasks block and wait when synchronously reading from an empty queue Scheduler selects a non-empty queue of highest priority and runs its task Time quantum switches Q s and Tasks inside top non-empty priority class Enqueues into higher priority Q s interrupt lower-priority running tasks RDMA is the Datapath Queues are the Control 13

14 Why copy data between User and Kernel/Library buffers? Sender user Mem user process virtual addresses ideal (zero copy) RDMA Receiver Memory user pinned buf cp kernel or lib kernel dma rd phy. addr. DMA vdma NI Network dma write ack NI ack cp pinned kernel or lib Pinned buffers: traditional DMA does not tolerate page faults Registered buffer addresses, in the lack of System (I/O) MMU Non-cacheable buffers, in the lack of cache-coherent DMA Send buffer reuse immediately after send initiation Transfer occurs before receive-buffer ready, and receive Application unwilling to accept reception at library-specified address 14

15 MPI: Send-Rcv Match at Receiver, User-Specified Rcv Addr Producer (user process): write into sb (early) receive ready to send, ready to at which address? send receive, but in my buffer, pointer to rb please! match sb rb done Consumer (user process): read from rb (done) cost = one extra network round-trip time for sender to learn the address 15

16 Snd-Rcv Match at Sender, User-Specified Receive Address Producer (user process): write into sb send match pointer to rb (early) receive ready to receive, but in my buffer, please! sb rb minimal latency achieved; early receive (rb pre-alloc by user) still needed does not work with MPI_ANY_SOURCE done Consumer (user process): read from rb (done) 16

17 Receiver willing to accept Data at any Address (and not copy) Producer (user process): write into sb send sb "allocate buffer, and... here come the data" or "use rb[i] that had been preallocated for me rb tell me when and where my data will be ready rcv match done Consumer (user process): read from rb (done) Eager delivery at preallocated, library-managed buffer, then given to user 17

18 Conclusions I/O is Interprocessor Communication and it should be made as fast as Compute Short messages: (remote) load/store-multiple instructions Long messages: virtualized DMA engine no system calls Global Virtual Address Space (GVAS) RDMA is the Datapath Queues (mbox) are the Control 18

19 Back-up Slides 19

20 Two Slides from my Stamatis Vassiliadis 2007 Symposium talk: Network Routing as Generalization of Address Decoding 1x: I/O 01: SRAM 1 00: SRAM 0 2 (a) P Physical Addr. SRAM 0 Data SRAM 1 I/O Brid ge Physical Address Decoding in a uniprocessor (b) P SRAM Packet body hdr 2 MS bi ts of physical desti nati on address No de 0 00: 1x: 01: No de 1 Geographical Address Routing in a multiprocessor No de 2 10: 11: No de

21 Progressive Address Translation: Localize Migration Updates P Tbl_A xx 1x xx 01 1x Tbl_B 00 0x Tbl_C 10 xx 11 0x Tbl_D x x Domain D pg9' Tbl_D pg Tbl_D3 Packets carry global virtual addresses Tables provide physical route (address) for the next few steps When page 9 migrates within D, only tables in that domain need updating Variable-size-page translation tables look like internet routing tables (longest-prefix matches if we want small-page-within-big-region migration) Tables that partition the system, for protection against untrusted operating systems, look like internet firewalls 21

Cluster Communica/on Latency:

Cluster Communica/on Latency: Cluster Communica/on Latency: towards approaching its Minimum Hardware Limits, on Low-Power Pla=orms Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete) http://www.ics.forth.gr/carv/

More information

Cluster Communica/on Latency:

Cluster Communica/on Latency: Cluster Communica/on Latency: towards approaching its Minimum Hardware Limits, on Low-Power Pla=orms Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete) http://www.ics.forth.gr/carv/

More information

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13 I/O Handling ECE 650 Systems Programming & Engineering Duke University, Spring 2018 Based on Operating Systems Concepts, Silberschatz Chapter 13 Input/Output (I/O) Typical application flow consists of

More information

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

More information

Review: Hardware user/kernel boundary

Review: Hardware user/kernel boundary Review: Hardware user/kernel boundary applic. applic. applic. user lib lib lib kernel syscall pg fault syscall FS VM sockets disk disk NIC context switch TCP retransmits,... device interrupts Processor

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010 Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 What is an Operating System? What is

More information

CS5460: Operating Systems

CS5460: Operating Systems CS5460: Operating Systems Lecture 2: OS Hardware Interface (Chapter 2) Course web page: http://www.eng.utah.edu/~cs5460/ CADE lab: WEB L224 and L226 http://www.cade.utah.edu/ Projects will be on Linux

More information

FlexNIC: Rethinking Network DMA

FlexNIC: Rethinking Network DMA FlexNIC: Rethinking Network DMA Antoine Kaufmann Simon Peter Tom Anderson Arvind Krishnamurthy University of Washington HotOS 2015 Networks: Fast and Growing Faster 1 T 400 GbE Ethernet Bandwidth [bits/s]

More information

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems Instructor: Dr. Turgay Korkmaz Department Computer Science The University of Texas at San Antonio Office: NPB

More information

CS350: Final Exam Review

CS350: Final Exam Review University of Waterloo CS350: Final Exam Review Gwynneth Leece, Andrew Song, Rebecca Putinski Winter, 2010 Intro, Threads & Concurrency What are the three views of an operating system? Describe them. Define

More information

Memory Hierarchy. Mehran Rezaei

Memory Hierarchy. Mehran Rezaei Memory Hierarchy Mehran Rezaei What types of memory do we have? Registers Cache (Static RAM) Main Memory (Dynamic RAM) Disk (Magnetic Disk) Option : Build It Out of Fast SRAM About 5- ns access Decoders

More information

Hardware OS & OS- Application interface

Hardware OS & OS- Application interface CS 4410 Operating Systems Hardware OS & OS- Application interface Summer 2013 Cornell University 1 Today How my device becomes useful for the user? HW-OS interface Device controller Device driver Interrupts

More information

Operating system Dr. Shroouq J.

Operating system Dr. Shroouq J. 2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication

More information

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security Input/Output Today Principles of I/O hardware & software I/O software layers Disks Next Protection & Security Operating Systems and I/O Two key operating system goals Control I/O devices Provide a simple,

More information

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand Device Programming Nima Honarmand read/write interrupt read/write Spring 2017 :: CSE 506 Device Interface (Logical View) Device Interface Components: Device registers Device Memory DMA buffers Interrupt

More information

Recall: Paging. Recall: Paging. Recall: Paging. CS162 Operating Systems and Systems Programming Lecture 13. Address Translation, Caching

Recall: Paging. Recall: Paging. Recall: Paging. CS162 Operating Systems and Systems Programming Lecture 13. Address Translation, Caching CS162 Operating Systems and Systems Programming Lecture 13 Address Translation, Caching March 7 th, 218 Profs. Anthony D. Joseph & Jonathan Ragan-Kelley http//cs162.eecs.berkeley.edu Recall Paging Page

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture 1 Last class: Course administration OS definition, some history Today: Background on Computer Architecture 2 Canonical System Hardware CPU: Processor to perform computations Memory: Programs and data I/O

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware. Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline

More information

CSC501 Operating Systems Principles. OS Structure

CSC501 Operating Systems Principles. OS Structure CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last

More information

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture Input/Output (I/O) Copyright 2012 Daniel J. Sorin Duke University

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture Input/Output (I/O) Copyright 2012 Daniel J. Sorin Duke University Introduction to Computer Architecture Input/Output () Copyright 2012 Daniel J. Sorin Duke University Slides are derived from work by Amir Roth (Penn) Spring 2012 Where We Are in This Course Right Now So

More information

Performance Evaluation of Myrinet-based Network Router

Performance Evaluation of Myrinet-based Network Router Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation

More information

Architecture and OS. To do. q Architecture impact on OS q OS impact on architecture q Next time: OS components and structure

Architecture and OS. To do. q Architecture impact on OS q OS impact on architecture q Next time: OS components and structure Architecture and OS To do q Architecture impact on OS q OS impact on architecture q Next time: OS components and structure Computer architecture and OS OS is intimately tied to the hardware it runs on

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Operating Systems. Introduction & Overview. Outline for today s lecture. Administrivia. ITS 225: Operating Systems. Lecture 1

Operating Systems. Introduction & Overview. Outline for today s lecture. Administrivia. ITS 225: Operating Systems. Lecture 1 ITS 225: Operating Systems Operating Systems Lecture 1 Introduction & Overview Jan 15, 2004 Dr. Matthew Dailey Information Technology Program Sirindhorn International Institute of Technology Thammasat

More information

Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability George Kalokerinos, Vassilis Papaefstathiou, George Nikiforos, Stamatis Kavadias, Manolis Katevenis, Dionisios

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

CS 3733 Operating Systems:

CS 3733 Operating Systems: CS 3733 Operating Systems: Topics: Memory Management (SGG, Chapter 08) Instructor: Dr Dakai Zhu Department of Computer Science @ UTSA 1 Reminders Assignment 2: extended to Monday (March 5th) midnight:

More information

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel Chapter-6 SUBJECT:- Operating System TOPICS:- I/O Management Created by : - Sanjay Patel Disk Scheduling Algorithm 1) First-In-First-Out (FIFO) 2) Shortest Service Time First (SSTF) 3) SCAN 4) Circular-SCAN

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging CS162 Operating Systems and Systems Programming Lecture 14 Caching and Demand Paging October 17, 2007 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs162 Review: Hierarchy of a Modern Computer

More information

Outline. Limited Scaling of a Bus

Outline. Limited Scaling of a Bus Outline Scalability physical, bandwidth, latency and cost level of integration Realizing rogramming Models network transactions protocols safety input buffer problem: N-1 fetch deadlock Communication Architecture

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016 Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan LegoOS A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang Y 4 1 2 Monolithic Server OS / Hypervisor 3 Problems? 4 cpu mem Resource

More information

Computer-System Organization (cont.)

Computer-System Organization (cont.) Computer-System Organization (cont.) Interrupt time line for a single process doing output. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism,

More information

CS533 Concepts of Operating Systems. Jonathan Walpole

CS533 Concepts of Operating Systems. Jonathan Walpole CS533 Concepts of Operating Systems Jonathan Walpole Improving IPC by Kernel Design & The Performance of Micro- Kernel Based Systems The IPC Dilemma IPC is very import in µ-kernel design - Increases modularity,

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

Scalable Distributed Memory Machines

Scalable Distributed Memory Machines Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands of processors. Design Choices: Custom-designed or commodity nodes? Network scalability. Capability

More information

CS162 Operating Systems and Systems Programming Lecture 16. I/O Systems. Page 1

CS162 Operating Systems and Systems Programming Lecture 16. I/O Systems. Page 1 CS162 Operating Systems and Systems Programming Lecture 16 I/O Systems March 31, 2008 Prof. Anthony D. Joseph http://inst.eecs.berkeley.edu/~cs162 Review: Hierarchy of a Modern Computer System Take advantage

More information

virtual memory Page 1 CSE 361S Disk Disk

virtual memory Page 1 CSE 361S Disk Disk CSE 36S Motivations for Use DRAM a for the Address space of a process can exceed physical memory size Sum of address spaces of multiple processes can exceed physical memory Simplify Management 2 Multiple

More information

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names Fast access ===> use map to find object HW == SW ===> map is in HW or SW or combo Extend range ===> longer, hierarchical names How is map embodied: --- L1? --- Memory? The Environment ---- Long Latency

More information

What Operating Systems Do An operating system is a program hardware that manages the computer provides a basis for application programs acts as an int

What Operating Systems Do An operating system is a program hardware that manages the computer provides a basis for application programs acts as an int Operating Systems Lecture 1 Introduction Agenda: What Operating Systems Do Computer System Components How to view the Operating System Computer-System Operation Interrupt Operation I/O Structure DMA Structure

More information

CSE Traditional Operating Systems deal with typical system software designed to be:

CSE Traditional Operating Systems deal with typical system software designed to be: CSE 6431 Traditional Operating Systems deal with typical system software designed to be: general purpose running on single processor machines Advanced Operating Systems are designed for either a special

More information

«Real Time Embedded systems» Multi Masters Systems

«Real Time Embedded systems» Multi Masters Systems «Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can

More information

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB) CS6290 Memory Views of Memory Real machines have limited amounts of memory 640KB? A few GB? (This laptop = 2GB) Programmer doesn t want to be bothered Do you think, oh, this computer only has 128MB so

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D, Flavors of Memory supported by Linux, their use and benefit Christoph Lameter, Ph.D, Twitter: @qant Flavors Of Memory The term computer memory is a simple term but there are numerous nuances

More information

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging CS162 Operating Systems and Systems Programming Lecture 14 Caching (Finished), Demand Paging October 11 th, 2017 Neeraja J. Yadwadkar http://cs162.eecs.berkeley.edu Recall: Caching Concept Cache: a repository

More information

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory 1 COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics Systemprogrammering 27 Föreläsning 4 Topics The memory hierarchy Motivations for VM Address translation Accelerating translation with TLBs Random-Access (RAM) Key features RAM is packaged as a chip. Basic

More information

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

Page 1. Review: Address Segmentation  Review: Address Segmentation  Review: Address Segmentation Review Address Segmentation " CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs" February 23, 2011! Ion Stoica! http//inst.eecs.berkeley.edu/~cs162! 1111 0000" 1110 000" Seg #"

More information

Motivations for Virtual Memory Virtual Memory Oct. 29, Why VM Works? Motivation #1: DRAM a Cache for Disk

Motivations for Virtual Memory Virtual Memory Oct. 29, Why VM Works? Motivation #1: DRAM a Cache for Disk class8.ppt 5-23 The course that gives CMU its Zip! Virtual Oct. 29, 22 Topics Motivations for VM Address translation Accelerating translation with TLBs Motivations for Virtual Use Physical DRAM as a Cache

More information

Predictable Interrupt Management and Scheduling in the Composite Component-based System

Predictable Interrupt Management and Scheduling in the Composite Component-based System Predictable Interrupt Management and Scheduling in the Composite Component-based System Gabriel Parmer and Richard West Computer Science Department Boston University Boston, MA 02215 {gabep1, richwest}@cs.bu.edu

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access

More information

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy Systemprogrammering 29 Föreläsning 4 Topics! The memory hierarchy! Motivations for VM! Address translation! Accelerating translation with TLBs Random-Access (RAM) Key features! RAM is packaged as a chip.!

More information

CS 5523 Operating Systems: Memory Management (SGG-8)

CS 5523 Operating Systems: Memory Management (SGG-8) CS 5523 Operating Systems: Memory Management (SGG-8) Instructor: Dr Tongping Liu Thank Dr Dakai Zhu, Dr Palden Lama, and Dr Tim Richards (UMASS) for providing their slides Outline Simple memory management:

More information

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW 2 SO FAR talked about in-kernel building blocks: threads memory IPC drivers

More information

Virtual Memory Oct. 29, 2002

Virtual Memory Oct. 29, 2002 5-23 The course that gives CMU its Zip! Virtual Memory Oct. 29, 22 Topics Motivations for VM Address translation Accelerating translation with TLBs class9.ppt Motivations for Virtual Memory Use Physical

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Scheduling. CS 161: Lecture 4 2/9/17

Scheduling. CS 161: Lecture 4 2/9/17 Scheduling CS 161: Lecture 4 2/9/17 Where does the first process come from? The Linux Boot Process Machine turned on; BIOS runs BIOS: Basic Input/Output System Stored in flash memory on motherboard Determines

More information

ECE 598 Advanced Operating Systems Lecture 14

ECE 598 Advanced Operating Systems Lecture 14 ECE 598 Advanced Operating Systems Lecture 14 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 22 March 2018 HW#6 was due. Announcements HW#7 will be posted eventually. Project

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Computer Architecture and OS. EECS678 Lecture 2

Computer Architecture and OS. EECS678 Lecture 2 Computer Architecture and OS EECS678 Lecture 2 1 Recap What is an OS? An intermediary between users and hardware A program that is always running A resource manager Manage resources efficiently and fairly

More information

ProtoFlex: FPGA Accelerated Full System MP Simulation

ProtoFlex: FPGA Accelerated Full System MP Simulation ProtoFlex: FPGA Accelerated Full System MP Simulation Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at Our work in this area has been supported in part

More information

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017 ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 The Operating System (OS) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke)

More information

Even coarse architectural trends impact tremendously the design of systems

Even coarse architectural trends impact tremendously the design of systems CSE 451: Operating Systems Spring 2006 Module 2 Architectural Support for Operating Systems John Zahorjan zahorjan@cs.washington.edu 534 Allen Center Even coarse architectural trends impact tremendously

More information

Page 1. Goals for Today" TLB organization" CS162 Operating Systems and Systems Programming Lecture 11. Page Allocation and Replacement"

Page 1. Goals for Today TLB organization CS162 Operating Systems and Systems Programming Lecture 11. Page Allocation and Replacement Goals for Today" CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement" Finish discussion on TLBs! Page Replacement Policies! FIFO, LRU! Clock Algorithm!! Working Set/Thrashing!

More information

Emulation 2. G. Lettieri. 15 Oct. 2014

Emulation 2. G. Lettieri. 15 Oct. 2014 Emulation 2 G. Lettieri 15 Oct. 2014 1 I/O examples In an emulator, we implement each I/O device as an object, or a set of objects. The real device in the target system is connected to the CPU via an interface

More information

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names Fast access ===> use map to find object HW == SW ===> map is in HW or SW or combo Extend range ===> longer, hierarchical names How is map embodied: --- L1? --- Memory? The Environment ---- Long Latency

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015 Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015 Datacenters: Scale and Latency Scale: 1M+ cores 1-10 PB memory 200 PB disk storage Latency: < 0.5 µs speed-of-light delay Most

More information

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit. Memory Hierarchy Goal: Fast, unlimited storage at a reasonable cost per bit. Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory. Fast: When you need something

More information

COS 318: Operating Systems. Virtual Memory and Address Translation

COS 318: Operating Systems. Virtual Memory and Address Translation COS 318: Operating Systems Virtual Memory and Address Translation Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Today s Topics

More information

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

Architectural Support for Operating Systems

Architectural Support for Operating Systems Architectural Support for Operating Systems (Chapter 2) CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse] Let s start at the very beginning 2 A Short History

More information

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems Processes CS 475, Spring 2018 Concurrent & Distributed Systems Review: Abstractions 2 Review: Concurrency & Parallelism 4 different things: T1 T2 T3 T4 Concurrency: (1 processor) Time T1 T2 T3 T4 T1 T1

More information

2. Link and Memory Architectures and Technologies

2. Link and Memory Architectures and Technologies 2. Link and Memory Architectures and Technologies 2.1 Links, Thruput/Buffering, Multi-Access Ovrhds 2.2 Memories: On-chip / Off-chip SRAM, DRAM 2.A Appendix: Elastic Buffers for Cross-Clock Commun. Manolis

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

Input / Output. Kevin Webb Swarthmore College April 12, 2018

Input / Output. Kevin Webb Swarthmore College April 12, 2018 Input / Output Kevin Webb Swarthmore College April 12, 2018 xkcd #927 Fortunately, the charging one has been solved now that we've all standardized on mini-usb. Or is it micro-usb? Today s Goals Characterize

More information

Multifunction Networking Adapters

Multifunction Networking Adapters Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year and Semester : II / IV Subject Code : CS6401 Subject Name : Operating System Degree and Branch : B.E CSE UNIT I 1. Define system process 2. What is an

More information