Cluster Communica/on Latency:
|
|
- Letitia Randall
- 5 years ago
- Views:
Transcription
1 Cluster Communica/on Latency: towards approaching its Minimum Hardware Limits, on Low-Power Pla=orms Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete)
2 Acknowledgements Iakovos Mavroidis John Goodacre (ARM, KALEAO) Vassilis Papaefstathiou Manolis Marazakis Nikolaos Chrysos and many many more!... 2
3 The Essence of Communica/on Producer node 2 node 1 node 3 node 4 node 5 write (remote) Interconnect Expected Consumers node 6 node 7 Producer node 2 node 1 node 3 node 4 node 5 read request (remote) reply (data) Interconnect Consumer node 6 node 7 node N node... node 8 node N node... node 8 Push style, like If you know the consumers else like spam If space already exists Zero latency, when pre-sent One-way latency, when sent just-in-/me Pull style, like web browser When reader decides it needs the data and allocates space Two-way latency, star/ng when the data is needed unless able to prefetch 3
4 Remote DMA versus Cache Coherence Producer Consumer Producer Consumer P P P P P P P P store C C C C store LM LM LM LM invalidate forwarded request data NoC request forwarded invalidate Lower Cache and Directory (Multiple Banks) NoC Remote Store (word) or Remote Write DMA (block) Example push style: explicitely sending the data: 5x fewer packets ( less /me, less energy) 4
5 Can we approach this ideal, in Cluster Communica/on? Cluster Compu/ng / Datacenters / HPC: many thousands of nodes node = island of (hardware) cache coherence Can we get Cluster Communica/on to be analogously fast like store & load instruc/ons in a coherence island? + small delay due to distance, but not much more Clusters originate from mass, commodity market, where customers did not care about a few extra μs Work done in Energy-Efficient Datacenter/HPC projects 64-bit ARM processors 5
6 Inefficiencies in Cluster Communica/on, to be overcomed Sender Memory user ideal (zero copy) RDMA Receiver Memory user cp DMA Sender NIC Network Receiver NIC DMA cp kernel or lib kernel or lib Five (5) copyings of the data, instead of just one (1)! Data copying consumes /me and energy Start by elimina/ng (large) NIC buffers: DMA across the network RDMA 6
7 DMA Ini/a/on: System Call, or directly from User-Level? Sender user Mem user process virtual addresses ideal (zero copy) RDMA Receiver Memory user pinned buf cp kernel or lib kernel dma rd phy. addr. DMA vdma NI ack Network dma write NI ack cp pinned kernel or lib System Call to ini/ate a DMA (with physical addr.) 3 to 4 μs on 64-bit ARM in Xilinx Zynq UltraScale+ under Linux For the virtualized DMA engine (8 channels), that accepts virtual address arguments, Ini/a/on (wri/ng 3-4 registers) <0.5 μs Next Ques/ons: Who translates addresses & where? Why copy User to/from Kernel/Library? 7
8 Scalability: Global Virtual Addresses & Progressive Transla/on Source Memory Hier archy read Translate (MMU) Copy (RDMA) Engine 64 bit virtual addresses src addr dst addr PID Network Global Virtual Address Space Routing Translate (MMU) write Desti nation Memory Hier archy Data for Scalability: not all nodes can be required to know all other nodes transla/on tables see Progressive Address Transla/on in the M. Katevenis paper in: Stama9s Vassiliadis Symposium 2007 Des/na/on Addresses in network packets must be Virtual for Scalability: 64-bit GVA + (global) protec/on domain iden/fier 8
9 Two Slides from my Stama9s Vassiliadis 2007 Symposium talk: Network Rou/ng as Generaliza/on of Address Decoding Physical Address Decoding in a uniprocessor Geographical Address Rou/ng in a mul/processor hwp:// 9
10 Progressive Address Transla/on: Localize Migra/on Updates Packets carry global virtual addresses Tables provide physical route (address) for the next few steps When page 9 migrates within D, only tables in that domain need upda/ng Variable-size-page transla/on tables look like internet rou/ng tables (longest-prefix matches if we want small-page-within-big-region migra/on) Tables that par//on the system, for protec/on against untrusted opera/ng systems, look like internet firewalls 10
11 Scalability: Global Virtual Addresses & Progressive Transla/on Source Memory Hier archy read Translate (MMU) Copy (RDMA) Engine 64 bit virtual addresses src addr dst addr PID Network Global Virtual Address Space Routing Translate (MMU) write Desti nation Memory Hier archy Data for Scalability: not all nodes can be required to know all other nodes transla/on tables see Progressive Address Transla/on in the M. Katevenis paper in: Stama9s Vassiliadis Symposium 2007 Des/na/on Addresses in network packets must be Virtual for Scalability: 64-bit GVA + (global) protec/on domain iden/fier 11
12 Current Address Transla/on Arch. not ready for GVAS read or write addresses, virtual or physical, but truncated to: Processor Cores Caches local DRAM vdma engine SMMU physical only addresses 44 bits 40 bits translate if virtual route based on phy. address 448 GBy window to other PS peripheral devices Address transla/on architecture of A53 in Xilinx Zynq UltraScale+ DMA engine is virtualized, but truncates virtual addresses to <64 b no mechanism to send the untranslated VA to the network (remote) load/store instruc/ons suffer compulsory addr. transla/on Programming System PS Programmable Logic PL FPGA 12
13 Why copy data between User and Kernel/Library buffers? Sender user Mem user process virtual addresses ideal (zero copy) RDMA Receiver Memory user pinned buf cp kernel or lib kernel dma rd phy. addr. DMA vdma NI ack Network dma write NI ack cp pinned kernel or lib Pinned buffers: tradi/onal DMA does not tolerate page faults Registered buffer addresses, in the lack of System (I/O) MMU Non-cacheable buffers, in the lack of cache-coherent DMA Send buffer reuse immediately ayer send ini/a/on Transfer occurs before receive-buffer ready, and receive Applica/on unwilling to accept recep/on at library-specified address 13
14 MPI: Send-Rcv Match at Receiver, User-Specified Rcv Addr Producer (user process): write into sb (early) receive ready to send, ready to at which address? send receive, but in my buffer, pointer to rb please! match sb rb done Consumer (user process): read from rb (done) cost = one extra network round-trip /me for sender to learn the address 14
15 Snd-Rcv Match at Sender, User-Specified Receive Address Producer (user process): write into sb send match pointer to rb (early) receive ready to receive, but in my buffer, please! sb rb minimal latency achieved; early receive (rb pre-alloc by user) s/ll needed does not work with MPI_ANY_SOURCE done Consumer (user process): read from rb (done) 15
16 Receiver willing to accept Data at any Address (and not copy) Producer (user process): write into sb send sb "allocate buffer, and... here come the data" or "use rb[i] that had been preallocated for me rb tell me when and where my data will be ready rcv match done Consumer (user process): read from rb (done) Eager delivery at preallocated, library-managed buffer, then given to user 16
17 Mbox Remote Enqueue: Asynchronous Event Mul/plexing P1 Multiple Senders P2 single word messages sent via single store store instruction memory 2 m2 m1... then sent memory 3 Remote Queue m1m2 "Mbox" Shared Space longer messages: first composed locally... wait on empty Atomic enqueue into shared space, unlike dedicated space with RDMA Space reserved only for number of actual senders not poten/al senders Event Mul/plexor requests, no/fica/ons (like Select system call) P3 Single Receiver RDMA comple/on should enqueue no/fica/on to receiver and sender 17
18 Per-Task Mbox Q s & Scheduler as Interrupt Generaliza/on High rity now arriving Prio Low Event Queues Q21 Q01 Q11 Processes / Tasks T21 T11 T01 Interrupt! Run Scheduler P core low power sleep when all Queues Empty switch task per time quantum Tasks block and wait when synchronously reading from an empty queue Scheduler selects a non-empty queue of highest priority and runs its task Time quantum switches Q s and Tasks inside top non-empty priority class Enqueues into higher priority Q s interrupt lower-priority running tasks RDMA is the Datapath Queues are the Control 18
19 Conclusions Interprocessor Commun. analogous to load/store instr. No system calls virtualized DMA engines Need a new Address Transla/on Architecture Global Virtual Address Space (GVAS) No/fica/ons, Mboxes, Scheduler: generalized interrupts Adapt libraries & API s to user-level, zero-copy commun. 19
Cluster Communica/on Latency:
Cluster Communica/on Latency: towards approaching its Minimum Hardware Limits, on Low-Power Pla=orms Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete) http://www.ics.forth.gr/carv/
More informationI/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more
I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete) http://www.ics.forth.gr/carv/
More informationReplicate and Migrate Objects in the Run5me not Cache Lines or Pages in Hardware
Replicate and Migrate Objects in the Run5me not Cache Lines or Pages in Hardware Manolis Katevenis FORTH ICS and Univ. of Crete, Greece BMW October 2010 FORTH Acknowledgements Alex Ramirez Dimitris Nikolopoulos
More informationReplicate and Migrate Objects in the Run5me not Cache Lines or Pages in Hardware
Replicate and Migrate Objects in the Run5me not Cache Lines or Pages in Hardware Manolis Katevenis FORTH ICS and Univ. of Crete, Greece BMW October 2010 FORTH Acknowledgements Alex Ramirez Dimitris Nikolopoulos
More informationNetwork Interface Architecture and Prototyping for Chip and Cluster Multiprocessors
University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors
More informationI/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13
I/O Handling ECE 650 Systems Programming & Engineering Duke University, Spring 2018 Based on Operating Systems Concepts, Silberschatz Chapter 13 Input/Output (I/O) Typical application flow consists of
More information6.9. Communicating to the Outside World: Cluster Networking
6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and
More informationSlides on cross- domain call and Remote Procedure Call (RPC)
Slides on cross- domain call and Remote Procedure Call (RPC) This classic paper is a good example of a microbenchmarking study. It also explains the RPC abstraction and serves as a case study of the nuts-and-bolts
More informationOpera&ng Systems: Principles and Prac&ce. Tom Anderson
Opera&ng Systems: Principles and Prac&ce Tom Anderson How This Course Fits in the UW CSE Curriculum CSE 333: Systems Programming Project experience in C/C++ How to use the opera&ng system interface CSE
More informationNo compromises: distributed transac2ons with consistency, availability, and performance
No compromises: distributed transac2ons with consistency, availability, and performance Aleksandar Dragojevic, Dushyanth Narayanan, Edmund B. Nigh2ngale, MaDhew Renzelmann, Alex Shamis, Anirudh Badam,
More informationCaching and Demand- Paged Virtual Memory
Caching and Demand- Paged Virtual Memory Defini8ons Cache Copy of data that is faster to access than the original Hit: if cache has copy Miss: if cache does not have copy Cache block Unit of cache storage
More informationThe Kernel Abstrac/on
The Kernel Abstrac/on Debugging as Engineering Much of your /me in this course will be spent debugging In industry, 50% of sobware dev is debugging Even more for kernel development How do you reduce /me
More informationReview: Hardware user/kernel boundary
Review: Hardware user/kernel boundary applic. applic. applic. user lib lib lib kernel syscall pg fault syscall FS VM sockets disk disk NIC context switch TCP retransmits,... device interrupts Processor
More informationLessons learned from MPI
Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.
More informationPerformance Evaluation of Myrinet-based Network Router
Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation
More informationBackground. IBM sold expensive mainframes to large organiza<ons. Monitor sits between one or more OSes and HW
Virtual Machines Background IBM sold expensive mainframes to large organiza
More informationAn Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout
An Implementation of the Homa Transport Protocol in RAMCloud Yilong Li, Behnam Montazeri, John Ousterhout Introduction Homa: receiver-driven low-latency transport protocol using network priorities HomaTransport
More informationFlexNIC: Rethinking Network DMA
FlexNIC: Rethinking Network DMA Antoine Kaufmann Simon Peter Tom Anderson Arvind Krishnamurthy University of Washington HotOS 2015 Networks: Fast and Growing Faster 1 T 400 GbE Ethernet Bandwidth [bits/s]
More informationBackground. IBM sold expensive mainframes to large organiza<ons. Monitor sits between one or more OSes and HW
Virtual Machines Background IBM sold expensive mainframes to large organiza
More informationScalable Multiprocessors
arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Scalable ultiprocessors What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication
More informationPrototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability
Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability George Kalokerinos, Vassilis Papaefstathiou, George Nikiforos, Stamatis Kavadias, Manolis Katevenis, Dionisios
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationHIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS
HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access
More informationRuntime OpenMP support using hardware primitives on explicitly memory managed multi-processor
UNIVERSITY OF LUGANO ALaRI INSTITUTE Master of Science in Embedded Systems Design Runtime OpenMP support using hardware primitives on explicitly memory managed multi-processor Master Thesis of: Pranav
More informationCSE Opera+ng System Principles
CSE 30341 Opera+ng System Principles Lecture 2 Introduc5on Con5nued Recap Last Lecture What is an opera+ng system & kernel? What is an interrupt? CSE 30341 Opera+ng System Principles 2 1 OS - Kernel CSE
More informationHETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse
HETEROGENEOUS MEMORY MANAGEMENT Linux Plumbers Conference 2018 Jérôme Glisse EVERYTHING IS A POINTER All data structures rely on pointers, explicitly or implicitly: Explicit in languages like C, C++,...
More informationCISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan
CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous
More informationM 3 Microkernel-based System for Heterogeneous Manycores
M 3 Microkernel-based System for Heterogeneous Manycores Nils Asmussen MKC, 06/29/2017 1 / 35 Outline 1 Introduction 2 Data Transfer Unit 3 Prototype Platforms 4 M 3 5 Summary 2 / 35 Outline 1 Introduction
More informationLinux Device Drivers: Case Study of a Storage Controller. Manolis Marazakis FORTH-ICS (CARV)
Linux Device Drivers: Case Study of a Storage Controller Manolis Marazakis FORTH-ICS (CARV) IOP348-based I/O Controller Programmable I/O Controller Continuous Data Protection: Versioning (snapshots), Migration,
More informationComputer-System Organization (cont.)
Computer-System Organization (cont.) Interrupt time line for a single process doing output. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism,
More informationCS350: Final Exam Review
University of Waterloo CS350: Final Exam Review Gwynneth Leece, Andrew Song, Rebecca Putinski Winter, 2010 Intro, Threads & Concurrency What are the three views of an operating system? Describe them. Define
More informationPacket Switch Architecture
Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.
More informationPacket Switch Architecture
Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.
More information«Real Time Embedded systems» Multi Masters Systems
«Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can
More informationCSE398: Network Systems Design
CSE398: Network Systems Design Instructor: Dr. Liang Cheng Department of Computer Science and Engineering P.C. Rossin College of Engineering & Applied Science Lehigh University February 23, 2005 Outline
More informationHakim Weatherspoon CS 3410 Computer Science Cornell University
Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Deniz Altinbuken, Professors Weatherspoon, Bala, Bracy, and Sirer. C practice
More informationInput/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security
Input/Output Today Principles of I/O hardware & software I/O software layers Disks Next Protection & Security Operating Systems and I/O Two key operating system goals Control I/O devices Provide a simple,
More informationUnderstanding MPI on Cray XC30
Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication
More informationHSA QUEUEING HOT CHIPS TUTORIAL - AUGUST 2013 IAN BRATT PRINCIPAL ENGINEER ARM
HSA QUEUEING HOT CHIPS TUTORIAL - AUGUST 2013 IAN BRATT PRINCIPAL ENGINEER ARM HSA QUEUEING, MOTIVATION MOTIVATION (TODAY S PICTURE) Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job
More informationMain Points. Address Transla+on Concept. Flexible Address Transla+on. Efficient Address Transla+on
Address Transla+on Main Points Address Transla+on Concept How do we convert a virtual address to a physical address? Flexible Address Transla+on Segmenta+on Paging Mul+level transla+on Efficient Address
More informationCS533 Concepts of Operating Systems. Jonathan Walpole
CS533 Concepts of Operating Systems Jonathan Walpole Improving IPC by Kernel Design & The Performance of Micro- Kernel Based Systems The IPC Dilemma IPC is very import in µ-kernel design - Increases modularity,
More informationUNIX Sockets. COS 461 Precept 1
UNIX Sockets COS 461 Precept 1 Socket and Process Communica;on application layer User Process Socket transport layer (TCP/UDP) OS network stack network layer (IP) link layer (e.g. ethernet) Internet Internet
More informationChapter 4: network layer. Network service model. Two key network-layer functions. Network layer. Input port functions. Router architecture overview
Chapter 4: chapter goals: understand principles behind services service models forwarding versus routing how a router works generalized forwarding instantiation, implementation in the Internet 4- Network
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationAgenda. About us Why para-virtualize RDMA Project overview Open issues Future plans
Agenda About us Why para-virtualize RDMA Project overview Open issues Future plans About us Marcel from KVM team in Redhat Yuval from Networking/RDMA team in Oracle This is a shared-effort open source
More informationLecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )
Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case
More information1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects
Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationParallel Computing Trends: from MPPs to NoWs
Parallel Computing Trends: from MPPs to NoWs (from Massively Parallel Processors to Networks of Workstations) Fall Research Forum Oct 18th, 1994 Thorsten von Eicken Department of Computer Science Cornell
More informationECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Interface with System Architecture. Prof. Natalie Enright Jerger
ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Interface with System Architecture Prof. Natalie Enright Jerger Systems and Interfaces Look at how systems interact and interface
More informationFaRM: Fast Remote Memory
FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main
More informationSpring 2017 :: CSE 506. Device Programming. Nima Honarmand
Device Programming Nima Honarmand read/write interrupt read/write Spring 2017 :: CSE 506 Device Interface (Logical View) Device Interface Components: Device registers Device Memory DMA buffers Interrupt
More informationModule 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks
Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware
More informationScalable Distributed Memory Machines
Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands of processors. Design Choices: Custom-designed or commodity nodes? Network scalability. Capability
More informationOutline. Limited Scaling of a Bus
Outline Scalability physical, bandwidth, latency and cost level of integration Realizing rogramming Models network transactions protocols safety input buffer problem: N-1 fetch deadlock Communication Architecture
More informationAnne Bracy CS 3410 Computer Science Cornell University
Anne Bracy CS 3410 Computer Science Cornell University The slides were originally created by Deniz ALTINBUKEN. P&H Chapter 4.9, pages 445 452, appendix A.7 Manages all of the software and hardware on the
More informationUsing the Cray Gemini Performance Counters
Photos placed in horizontal position with even amount of white space between photos and header Using the Cray Gemini Performance Counters 0 1 2 3 4 5 6 7 Backplane Backplane 8 9 10 11 12 13 14 15 Backplane
More informationNetworks and Opera/ng Systems Chapter 19: I/O Subsystems
Networks and Opera/ng Systems Chapter 19: I/O Subsystems (252 0062 00) Donald Kossmann & Torsten Hoefler Frühjahrssemester 2013 Systems Group Department of Computer Science ETH Zürich Last /me: File System
More informationEvaluation and Improvements of Programming Models for the Intel SCC Many-core Processor
Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Carsten Clauss, Stefan Lankes, Pablo Reble, Thomas Bemmerl International Workshop on New Algorithms and Programming
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationCSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1
CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson
More informationProAc&ve Rou&ng In Scalable Data Centers with PARIS
ProAc&ve Rou&ng In Scalable Data Centers with PARIS Theophilus Benson Duke University Joint work with Dushyant Arora + and Jennifer Rexford* + Arista Networks *Princeton University Data Center Networks
More informationFORTH-ICS / TR-392 July Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors. Abstract
FORTH-ICS / TR-392 July 2007 Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors Michael Papamichael Abstract As computing and embedded systems evolve towards highly parallel
More informationLecture 8: Memory Management
Lecture 8: Memory Management CSE 120: Principles of Opera>ng Systems UC San Diego: Summer Session I, 2009 Frank Uyeda Announcements PeerWise ques>ons due tomorrow. Project 2 is due on Friday. Milestone
More informationInterconnect Routing
Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through
More informationOpera&ng Systems ECE344
Opera&ng Systems ECE344 Lecture 8: Paging Ding Yuan Lecture Overview Today we ll cover more paging mechanisms: Op&miza&ons Managing page tables (space) Efficient transla&ons (TLBs) (&me) Demand paged virtual
More informationThe influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte Alain Greiner Univ. Paris 6, France http://mpc.lip6.fr
More informationMemory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time.
Memory Management To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time. Basic CPUs and Physical Memory CPU cache Physical memory
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More informationThe Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith
The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs
More informationNetwork Processors. Nevin Heintze Agere Systems
Network Processors Nevin Heintze Agere Systems Network Processors What are the packaging challenges for NPs? Caveat: I know very little about packaging. Network Processors What are the packaging challenges
More informationOperating system Dr. Shroouq J.
2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication
More informationVirtual Memory B: Objec5ves
Virtual Memory B: Objec5ves Benefits of a virtual memory system" Demand paging, page-replacement algorithms, and allocation of page frames" The working-set model" Relationship between shared memory and
More informationDISTRIBUTED COMPUTER SYSTEMS
DISTRIBUTED COMPUTER SYSTEMS Communication Fundamental REMOTE PROCEDURE CALL Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Outline Communication Architecture Fundamentals
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester Section Subject Code Subject Name Degree & Branch : I & II : M.E : CP7204 : Advanced Operating Systems : M.E C.S.E. 1. Define Process? UNIT-1
More information2. Link and Memory Architectures and Technologies
2. Link and Memory Architectures and Technologies 2.1 Links, Thruput/Buffering, Multi-Access Ovrhds 2.2 Memories: On-chip / Off-chip SRAM, DRAM 2.A Appendix: Elastic Buffers for Cross-Clock Commun. Manolis
More informationExercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing
Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous
More informationLink Layer. w/ credit to Rick Graziani (Cabrillo) for some of the anima<ons
Link Layer w/ credit to Rick Graziani (Cabrillo) for some of the anima
More informationFPGA implementation of a Cache Controller with Configurable Scratchpad Space
FORTH-ICS/TR-402 January 2010 FPGA implementation of a Cache Controller with Configurable Scratchpad Space Giorgos Nikiforos Abstract Chip Multiprocessors (CMP) are the dominant architectural approach
More informationMaking Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures
Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Making Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures Sarah
More informationCSE Traditional Operating Systems deal with typical system software designed to be:
CSE 6431 Traditional Operating Systems deal with typical system software designed to be: general purpose running on single processor machines Advanced Operating Systems are designed for either a special
More informationKeystone Architecture Inter-core Data Exchange
Application Report Lit. Number November 2011 Keystone Architecture Inter-core Data Exchange Brighton Feng Vincent Han Communication Infrastructure ABSTRACT This application note introduces various methods
More informationChapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST
Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial
More informationMultifunction Networking Adapters
Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained
More informationNetFPGA Hardware Architecture
NetFPGA Hardware Architecture Jeffrey Shafer Some slides adapted from Stanford NetFPGA tutorials NetFPGA http://netfpga.org 2 NetFPGA Components Virtex-II Pro 5 FPGA 53,136 logic cells 4,176 Kbit block
More informationOperating Systems Comprehensive Exam. Spring Student ID # 3/16/2006
Operating Systems Comprehensive Exam Spring 2006 Student ID # 3/16/2006 You must complete all of part I (60%) You must complete two of the three sections in part II (20% each) In Part I, circle or select
More informationLarge Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele
Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling
More informationLecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance
Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,
More informationPredictable Interrupt Management and Scheduling in the Composite Component-based System
Predictable Interrupt Management and Scheduling in the Composite Component-based System Gabriel Parmer and Richard West Computer Science Department Boston University Boston, MA 02215 {gabep1, richwest}@cs.bu.edu
More informationOperating Systems Comprehensive Exam. Spring Student ID # 3/20/2013
Operating Systems Comprehensive Exam Spring 2013 Student ID # 3/20/2013 You must complete all of Section I You must complete two of the problems in Section II If you need more space to answer a question,
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationI/O Device Controllers. I/O Systems. I/O Ports & Memory-Mapped I/O. Direct Memory Access (DMA) Operating Systems 10/20/2010. CSC 256/456 Fall
I/O Device Controllers I/O Systems CS 256/456 Dept. of Computer Science, University of Rochester 10/20/2010 CSC 2/456 1 I/O devices have both mechanical component & electronic component The electronic
More informationPersistent Memory Over Fabrics. Paul Grun, Cray Inc Stephen Bates, Eideticom Rob Davis, Mellanox Technologies
Persistent Memory Over Fabrics Paul Grun, Cray Inc Stephen Bates, Eideticom Rob Davis, Mellanox Technologies Agenda Persistent Memory as viewed by a consumer, and some guidance to the fabric community
More informationµ-kernel Construction (12)
µ-kernel Construction (12) Review 1 Threading Thread state must be saved/restored on thread switch We need a Thread Control Block (TCB) per thread TCBs must be kernel objects TCBs implement threads We
More informationAdvanced Computer Networks. End Host Optimization
Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct
More informationMoneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010
Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed
More informationDesigning Next Generation Data-Centers with Advanced Communication Protocols and Systems Services
Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory
More informationUnder the Hood, Part 1: Implementing Message Passing
Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Today s Theme 2 Message passing model (abstraction) Threads
More informationLecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations
Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,
More informationLOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2016 Hermann Härtig LECTURE OBJECTIVES starting points independent Unix processes and block synchronous execution which component (point in
More information