Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing
|
|
- Claude Fowler
- 5 years ago
- Views:
Transcription
1 Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, ebay, Georgia Tech, Chonbuk National University April 26, 2018 Changwoo Min Solros: Data-Centric OS April 26, / 21
2 Cambrian Explosion of Processor Architecture Specialization of general-purpose processors Changwoo Min Solros: Data-Centric OS April 26, / 21
3 Cambrian Explosion of Processor Architecture Specialization of general-purpose processors Generalization of co-processors Changwoo Min Solros: Data-Centric OS April 26, / 21
4 Cambrian Explosion of Processor Architecture Specialization of general-purpose processors Generalization of co-processors Specialization of co-processors Changwoo Min Solros: Data-Centric OS April 26, / 21
5 Blazingly fast IO Devices Blazingly fast storage/memory Changwoo Min Solros: Data-Centric OS April 26, / 21
6 Blazingly fast IO Devices Blazingly fast storage/memory Blazingly fast network Changwoo Min Solros: Data-Centric OS April 26, / 21
7 Blazingly fast IO Devices How to exploit the full potential of such hardware devices without pain? System-wide performance Ease of programming Blazingly fast storage/memory Blazingly fast network Changwoo Min Solros: Data-Centric OS April 26, / 21
8 Outline 1 Heterogeneous Computing Architectures 2 Solros: Split-Kernel Approach Solros Architecture Operating System Services 3 Evaluation Changwoo Min Solros: Data-Centric OS April 26, / 21
9 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor OS I/O device SSD / NIC Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21
10 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS I/O device SSD / NIC Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21
11 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS 2 I/O device SSD / NIC Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21
12 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS 2 I/O device SSD / NIC 3 Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21
13 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS 2 I/O device SSD / NIC 3 Co-processor control data Problem Redundant data communication Complex to program and hard to optimize Changwoo Min Solros: Data-Centric OS April 26, / 21
14 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC Co-processor OS control data Changwoo Min Solros: Data-Centric OS April 26, / 21
15 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC Co-processor 1 OS control data Changwoo Min Solros: Data-Centric OS April 26, / 21
16 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC 2 Co-processor 1 OS control data Changwoo Min Solros: Data-Centric OS April 26, / 21
17 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC 2 Co-processor 1 OS control data Problem Significant effort required for porting IO stack to co-processor Not completely exploiting powerful host processors Changwoo Min Solros: Data-Centric OS April 26, / 21
18 Outline 1 Heterogeneous Computing Architectures 2 Solros: Split-Kernel Approach Solros Architecture Operating System Services 3 Evaluation Changwoo Min Solros: Data-Centric OS April 26, / 21
19 Solros Goal Ease of programming Best use of processor architecture System-wide optimization Changwoo Min Solros: Data-Centric OS April 26, / 21
20 Solros Goal Ease of programming Best use of processor architecture System-wide optimization Challenge Co-processor needs IO abstraction IO stacks is branch-divergent and difficult to parallelize It needs system-wide information Changwoo Min Solros: Data-Centric OS April 26, / 21
21 Solros Architecture Split-Kernel Architecture Data-plane OS Runs on a co-processor Provides IO abstraction Delegates actual IO operations to a control-plane OS Control-plane OS Runs on a host processor Runs actual IO stack Performs system-wide coordination Changwoo Min Solros: Data-Centric OS April 26, / 21
22 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor OS stub Host processor OS proxy Policy I/O device SSD / NIC control data Changwoo Min Solros: Data-Centric OS April 26, / 21
23 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub Host processor OS proxy Policy I/O device SSD / NIC control data Changwoo Min Solros: Data-Centric OS April 26, / 21
24 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub I/O device SSD / NIC Host processor 2 OS proxy Policy control data Changwoo Min Solros: Data-Centric OS April 26, / 21
25 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub I/O device SSD / NIC 3 Host processor 2 OS proxy Policy control data Changwoo Min Solros: Data-Centric OS April 26, / 21
26 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub I/O device SSD / NIC 3 Host processor 2 OS proxy Policy control data Co-processor has OS abstraction with minimal effort Best use of each of the fat and lean processors Efficient global coordination among devices (policy) Changwoo Min Solros: Data-Centric OS April 26, / 21
27 Operating System Services 1 Transport service 2 Filesystem service 3 Network service Changwoo Min Solros: Data-Centric OS April 26, / 21
28 Operating System Services 1 Transport service 2 Filesystem service 3 Network service Changwoo Min Solros: Data-Centric OS April 26, / 21
29 Transport Service High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor Changwoo Min Solros: Data-Centric OS April 26, / 21
30 Transport Service High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor Our approach Uniform data transfer system-mapped PCIe window High contention combining, replication, interleaving, etc. Asymmetric performance flexibly configurable (host DMA engine vs. co-processor DMA engine) Changwoo Min Solros: Data-Centric OS April 26, / 21
31 Transport Service High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor Our approach Uniform data transfer system-mapped PCIe window High contention combining, replication, interleaving, etc. Asymmetric performance flexibly configurable (host DMA engine vs. co-processor DMA engine) See details in the paper Changwoo Min Solros: Data-Centric OS April 26, / 21
32 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub Host processor File system proxy File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21
33 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21
34 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy 3 File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21
35 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy 3 4 File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21
36 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 5 DMA engine SSD Host processor File system proxy 3 4 File system PCIe control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21
37 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy Buffer cache File system PCIe control data DMA engine SSD Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21
38 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy Buffer cache 3 File system PCIe control data DMA engine SSD Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21
39 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy Buffer cache 3 File system 4 PCIe control data DMA engine SSD Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21
40 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 control data DMA engine SSD Host processor File system proxy Buffer cache 3 File system 4 PCIe 5 Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21
41 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 6 2 control data DMA engine SSD Host processor File system proxy Buffer cache 3 File system 4 PCIe 5 Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21
42 Implementation Host: 2-socket Xeon processor (12 cores each) Co-processor: 4 Xeon Phi (KNC, 61 cores, Linux, PCIe Gen 3x16) Storage device: 4 NVMe SSD NIC: 100 Gbps Ethernet Module Added lines Lines of code Deleted lines Transport service 1, File system Service Network Service Stub 5,957 2,073 Proxy 2, Stub 2, Proxy 5, NVMe device driver SCIF kernel module Total 18,844 2,714 Changwoo Min Solros: Data-Centric OS April 26, / 21
43 Evaluation Questions: Performance of Solros services Impact on real-world applications Changwoo Min Solros: Data-Centric OS April 26, / 21
44 Performance of Solros Services GB/sec (a) file random read on SSD 2MB 1MB 512KB 256KB 128KB 64KB 32KB Phi-Solros Phi-Linux Percentage of requests (%) MB 0 (b) TCP latency: 64-byte message Phi-Solros Phi-Linux Latency(usec) File IO performance: 19x faster than the stock Linux on Xeon Phi TCP latency (99 percentile): 7x shorter than the stock Linux on Xeon Phi Changwoo Min Solros: Data-Centric OS April 26, / 21
45 Performance of Solros Services 6 (a) file random read on SSD File system Block/Transport Storage 10 8 (b) TCP latency: 64-byte message Network stack Proxy/Transport Latency(msec) Phi-Linux Phi-Solros 0 Phi-Linux Phi-Solros Significant performance gain in data transport Running IO stack on co-processor is slower Changwoo Min Solros: Data-Centric OS April 26, / 21
46 Real-world - Image Search Image search engine is running on Xeon Phi Image database is on NVMe SSD (shared read-only) Image search queries are from network Elapsed Time (sec) Phi-Solros Phi-Linux Bandwidth (MB/s) Phi-Solros Phi-Linux Time(sec) Solros performs 2x faster than stock Linux on Xeon Phi Changwoo Min Solros: Data-Centric OS April 26, / 21
47 Conclusion Solros, a new operating system architecture for co-processors and fast IO devices Control-plane, data-plane architecture allow: Supporting high-level OS abstraction on co-processor Efficient global coordination among devices Near ideal IO performance from co-processor We will release source code soon Changwoo Min Solros: Data-Centric OS April 26, / 21
48 Image Search - Scalability Increase the number of Xeon Phi to 4 4 Normalized speedup The number of Xeon Phi co-processors Solros load balancing mechanism achieves near linear scaling Changwoo Min Solros: Data-Centric OS April 26, / 21
49 Related work Control-plane/data-plane OS: Arrakis [OSDI 14], IX [OSDI 14] OS for heterogeneous systems: Helios [SOSP]09], M3 [ASPLOS 16], Hydra [ASPLOS 08] IO support for GPU: PTask [SOSP 11], GPUfs [ASPLOS 13], GPUnet [OSDI 14] Changwoo Min Solros: Data-Centric OS April 26, / 21
50 Transport service Host's virtual address space Host's physical address space Device's physical address sapce App 1 App 2 Host RAM PCIe Window mmio DRAM NIC mmio DRAM mapped to Xeon Phi mapped to (in-host) (across devices) mmio DRAM... mmio NVMe Changwoo Min Solros: Data-Centric OS April 26, / 21
51 Network Service (TCP) Outbound operation Inbound operation Co-processor TCP stub Host processor TCP proxy Load balancer 2 TCP stack 1 PCIe control data NIC A load balancer on a host distributes incoming TCP connections to one of least-loaded co-processors. See details in the paper. Changwoo Min Solros: Data-Centric OS April 26, / 21
52 Discussion Hardware support other than Xeon Phi Two atomic instructions: transport service MMU: isolation among co-processor applications Scalability of control-plane OS Limited by scalability of OS service, PCIe interconnect, and performance of IO devices Changwoo Min Solros: Data-Centric OS April 26, / 21
53 Real-world - Text Search CLucene text indexing engine running on Xeon Phi Text data is on NVMe SSD Elapsed Time (sec) Phi-Solros Phi-Linux Bandwidth (GB/s) Phi-Solros Phi-Linux (virtio) Time(sec) Solros performs 19x faster than stock Linux (ext4/virtio) on Xeon Phi Changwoo Min Solros: Data-Centric OS April 26, / 21
Mosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationScaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX
Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic
More informationRDMA and Hardware Support
RDMA and Hardware Support SIGCOMM Topic Preview 2018 Yibo Zhu Microsoft Research 1 The (Traditional) Journey of Data How app developers see the network Under the hood This architecture had been working
More informationA Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan
LegoOS A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang Y 4 1 2 Monolithic Server OS / Hypervisor 3 Problems? 4 cpu mem Resource
More informationAccelerator-centric operating systems
Accelerator-centric operating systems Rethinking the role of s in modern computers Mark Silberstein EE, Technion System design challenge: Programmability and Performance 2 System design challenge: Programmability
More informationMessaging Overview. Introduction. Gen-Z Messaging
Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional
More informationPASTE: A Network Programming Interface for Non-Volatile Main Memory
PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018
More informationRe-architecting Virtualization in Heterogeneous Multicore Systems
Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College
More informationFast packet processing in the cloud. Dániel Géhberger Ericsson Research
Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration
More informationPASTE: A Networking API for Non-Volatile Main Memory
PASTE: A Networking API for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Lars Eggert (NetApp) Douglas Santry (NetApp) TSVAREA@IETF 99, Prague May 22th 2017 More details at our HotNets
More informationImportant new NVMe features for optimizing the data pipeline
Important new NVMe features for optimizing the data pipeline Dr. Stephen Bates, CTO Eideticom Santa Clara, CA 1 Outline Intro to NVMe Controller Memory Buffers (CMBs) Use cases for CMBs Submission Queue
More informationFarewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation
Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang Datacenter 3 Monolithic Computer OS / Hypervisor 4 Can monolithic Application Hardware
More informationMaximizing heterogeneous system performance with ARM interconnect and CCIX
Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable
More informationEfficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han
Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi, Jiwon Kim, Hwansoo Han Operations Per Second Storage Latency Close to DRAM SATA/SAS Flash SSD (~00μs) PCIe Flash SSD (~60 μs) D-XPoint
More informationAccelerating Data Centers Using NVMe and CUDA
Accelerating Data Centers Using NVMe and CUDA Stephen Bates, PhD Technical Director, CSTO, PMC-Sierra Santa Clara, CA 1 Project Donard @ PMC-Sierra Donard is a PMC CTO project that leverages NVM Express
More informationOvercoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics
Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing
More informationBe Fast, Cheap and in Control with SwitchKV. Xiaozhou Li
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level
More informationFarewell to Servers: Resource Disaggregation
Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / Hypervisor 3 Can monolithic Application Hardware servers
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationManaging Array of SSDs When the Storage Device is No Longer the Performance Bottleneck
Managing Array of Ds When the torage Device is No Longer the Performance Bottleneck Byung. Kim, Jaeho Kim, am H. Noh UNIT (Ulsan National Institute of cience & Technology) Outline Motivation & Observation
More informationMemory-Based Cloud Architectures
Memory-Based Cloud Architectures ( Or: Technical Challenges for OnDemand Business Software) Jan Schaffner Enterprise Platform and Integration Concepts Group Example: Enterprise Benchmarking -) *%'+,#$)
More informationMellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007
Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise
More informationApplication Acceleration Beyond Flash Storage
Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage
More informationGeneric System Calls for GPUs
Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices
More informationEXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab
EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT Konstantinos Alexopoulos ECE NTUA CSLab MOTIVATION HPC, Multi-node & Heterogeneous Systems Communication with low latency
More informationMoneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories
Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems
More informationHow Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC
How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC Three Consortia Formed in Oct 2016 Gen-Z Open CAPI CCIX complex to rack scale memory fabric Cache coherent accelerator
More informationI/O Systems. Jo, Heeseung
I/O Systems Jo, Heeseung Today's Topics Device characteristics Block device vs. Character device Direct I/O vs. Memory-mapped I/O Polling vs. Interrupts Programmed I/O vs. DMA Blocking vs. Non-blocking
More informationMartin Dubois, ing. Contents
Martin Dubois, ing Contents Without OpenNet vs With OpenNet Technical information Possible applications Artificial Intelligence Deep Packet Inspection Image and Video processing Network equipment development
More informationExploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters
Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth
More informationSoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research
SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research 1 The world s most valuable resource Data is everywhere! May. 2017 Values from Data! Need infrastructures for
More informationUnderstanding Manycore Scalability of File Systems. Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim
Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim Application must parallelize I/O operations Death of single core CPU scaling
More informationExtremely Fast Distributed Storage for Cloud Service Providers
Solution brief Intel Storage Builders StorPool Storage Intel SSD DC S3510 Series Intel Xeon Processor E3 and E5 Families Intel Ethernet Converged Network Adapter X710 Family Extremely Fast Distributed
More informationGPUfs: Integrating a file system with GPUs
ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications
More informationREMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS
13th ANNUAL WORKSHOP 2017 REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS Tom Talpey Microsoft [ March 31, 2017 ] OUTLINE Windows Persistent Memory Support A brief summary, for better
More informationMapping MPI+X Applications to Multi-GPU Architectures
Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under
More informationFalcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University
Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput
More informationNFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications
NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications Outline RDMA Motivating trends iwarp NFS over RDMA Overview Chelsio T5 support Performance results 2 Adoption Rate of 40GbE Source: Crehan
More informationAn NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin
An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin 1 Overview Acceleration for Storage NVMe for Acceleration How are we using (abusing ;-)) NVMe to support
More informationLearn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance
Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance TechTarget Dennis Martin 1 Agenda About Demartek I/O Virtualization Concepts RDMA Concepts Examples Demartek
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationDPDK Summit China 2017
Summit China 2017 Embedded Network Architecture Optimization Based on Lin Hao T1 Networks Agenda Our History What is an embedded network device Challenge to us Requirements for device today Our solution
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Adam Belay et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Presented by Han Zhang & Zaina Hamid Challenges
More informationEleos: Exit-Less OS Services for SGX Enclaves
Eleos: Exit-Less OS Services for SGX Enclaves Meni Orenbach Marina Minkin Pavel Lifshits Mark Silberstein Accelerated Computing Systems Lab Haifa, Israel What do we do? Improve performance: I/O intensive
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationComputer Systems Laboratory Sungkyunkwan University
I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage
More informationNear- Data Computa.on: It s Not (Just) About Performance
Near- Data Computa.on: It s Not (Just) About Performance Steven Swanson Non- Vola0le Systems Laboratory Computer Science and Engineering University of California, San Diego 1 Solid State Memories NAND
More informationThis presentation provides an overview of Gen Z architecture and its application in multiple use cases.
This presentation provides an overview of Gen Z architecture and its application in multiple use cases. 1 2 Despite numerous advances in data storage and computation, data access complexity continues to
More informationIBM Power AC922 Server
IBM Power AC922 Server The Best Server for Enterprise AI Highlights More accuracy - GPUs access system RAM for larger models Faster insights - significant deep learning speedups Rapid deployment - integrated
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More informationDesigning Next Generation FS for NVMe and NVMe-oF
Designing Next Generation FS for NVMe and NVMe-oF Liran Zvibel CTO, Co-founder Weka.IO @liranzvibel Santa Clara, CA 1 Designing Next Generation FS for NVMe and NVMe-oF Liran Zvibel CTO, Co-founder Weka.IO
More informationUnderstanding and Improving the Cost of Scaling Distributed Event Processing
Understanding and Improving the Cost of Scaling Distributed Event Processing Shoaib Akram, Manolis Marazakis, and Angelos Bilas shbakram@ics.forth.gr Foundation for Research and Technology Hellas (FORTH)
More informationIBM Power Advanced Compute (AC) AC922 Server
IBM Power Advanced Compute (AC) AC922 Server The Best Server for Enterprise AI Highlights IBM Power Systems Accelerated Compute (AC922) server is an acceleration superhighway to enterprise- class AI. A
More informationExtending the NVMHCI Standard to Enterprise
Extending the NVMHCI Standard to Enterprise Amber Huffman Principal Engineer Intel Corporation August 2009 1 Outline Remember: What is NVMHCI PCIe SSDs Coming with Challenges Enterprise Extensions to NVMHCI
More informationTo Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC
To Infiniband or Not Infiniband, One Site s s Perspective Steve Woods MCNC 1 Agenda Infiniband background Current configuration Base Performance Application performance experience Future Conclusions 2
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationDesigning a True Direct-Access File System with DevFS
Designing a True Direct-Access File System with DevFS Sudarsun Kannan, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau University of Wisconsin-Madison Yuangang Wang, Jun Xu, Gopinath Palani Huawei Technologies
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationDeveloping deterministic networking technology for railway applications using TTEthernet software-based end systems
Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Project n 100021 Astrit Ademaj, TTTech Computertechnik AG Outline GENESYS requirements
More informationGPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski
Author: Andrzej Jackowski 1 Author: Andrzej Jackowski 2 GPU programming problem 3 GPU distributed application flow 1. recv req Network 4. send repl 2. exec on GPU CPU & Memory 3. get results GPU & Memory
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationEXPERIENCES WITH NVME OVER FABRICS
13th ANNUAL WORKSHOP 2017 EXPERIENCES WITH NVME OVER FABRICS Parav Pandit, Oren Duer, Max Gurtovoy Mellanox Technologies [ 31 March, 2017 ] BACKGROUND: NVME TECHNOLOGY Optimized for flash and next-gen
More informationETHOS A Generic Ethernet over Sockets Driver for Linux
ETHOS A Generic Ethernet over Driver for Linux Parallel and Distributed Computing and Systems Rainer Finocchiaro Tuesday November 18 2008 CHAIR FOR OPERATING SYSTEMS Outline Motivation Architecture of
More informationp2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting
p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting 1 Nomenclature: A Reminder PCIe Peer-2Peer using p2pmem is NOT blucky!! 2 The Rationale: A Reminder 3 The Rationale: A
More informationDatabase Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu
Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based
More informationLITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang
LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai, Yiying Zhang Time 2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 Berkeley Socket TCP Offload engine Arrakis & mtcp IX Userspace
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationUnblinding the OS to Optimize User-Perceived Flash SSD Latency
Unblinding the OS to Optimize User-Perceived Flash SSD Latency Woong Shin *, Jaehyun Park **, Heon Y. Yeom * * Seoul National University ** Arizona State University USENIX HotStorage 2016 Jun. 21, 2016
More informationBroadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence
TM Artificial Intelligence Server for Fraud Date: Q2 2017 Application: Artificial Intelligence Tags: Artificial intelligence, GPU, GTX 1080 TI HM Revenue & Customs The UK s tax, payments and customs authority
More informationCSC501 Operating Systems Principles. OS Structure
CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last
More informationNo Tradeoff Low Latency + High Efficiency
No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS),
More informationSoftware and Tools for HPE s The Machine Project
Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric
More informationWormTerminator: : An Effective Containment of Unknown and Polymorphic Fast Spreading Worms
WormTerminator: : An Effective Containment of Unknown and Polymorphic Fast Spreading Worms Songqing Chen, Xinyuan Wang, Lei Liu George Mason University, VA Xinwen Zhang Samsung Computer Science Lab, CA
More informationSmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center
SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center Jeff Defilippi Senior Product Manager Arm #Arm Tech Symposia The Cloud to Edge Infrastructure Foundation for a World of 1T Intelligent
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of
More informationChelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING
Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity
More informationPocket: Elastic Ephemeral Storage for Serverless Analytics
Pocket: Elastic Ephemeral Storage for Serverless Analytics Ana Klimovic*, Yawen Wang*, Patrick Stuedi +, Animesh Trivedi +, Jonas Pfefferle +, Christos Kozyrakis* *Stanford University, + IBM Research 1
More informationM 3 Microkernel-based System for Heterogeneous Manycores
M 3 Microkernel-based System for Heterogeneous Manycores Nils Asmussen MKC, 06/29/2017 1 / 48 Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation Heterogeneous
More informationHigh-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han
High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han Seoul National University, Korea Dongduk Women s University, Korea Contents Motivation and Background
More informationData Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming
More informationAn FPGA-Based Optical IOH Architecture for Embedded System
An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing
More informationThe Why and How of Developing All-Flash Storage Server
The Why and How of Developing All-Flash Storage Server June 2016 Jungsoo Kim Manager, SK Telecom Agenda Why we care about All-Flash Storage Transforming to 5G Network Open HW & SW Projects @ SKT Our approaches
More informationGPUs have enormous power that is enormously difficult to use
524 GPUs GPUs have enormous power that is enormously difficult to use Nvidia GP100-5.3TFlops of double precision This is equivalent to the fastest super computer in the world in 2001; put a single rack
More informationAdvanced Memory Organizations
CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU
More informationMoneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010
Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed
More informationThe SNIA NVM Programming Model. #OFADevWorkshop
The SNIA NVM Programming Model #OFADevWorkshop Opportunities with Next Generation NVM NVMe & STA SNIA 2 NVM Express/SCSI Express: Optimized storage interconnect & driver SNIA NVM Programming TWG: Optimized
More informationOpen Fabrics Workshop 2013
Open Fabrics Workshop 2013 OFS Software for the Intel Xeon Phi Bob Woodruff Agenda Intel Coprocessor Communication Link (CCL) Software IBSCIF RDMA from Host to Intel Xeon Phi Direct HCA Access from Intel
More informationInitial Evaluation of a User-Level Device Driver Framework
Initial Evaluation of a User-Level Device Driver Framework Stefan Götz Karlsruhe University Germany sgoetz@ira.uka.de Kevin Elphinstone National ICT Australia University of New South Wales kevine@cse.unsw.edu.au
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationApplication Access to Persistent Memory The State of the Nation(s)!
Application Access to Persistent Memory The State of the Nation(s)! Stephen Bates, Paul Grun, Tom Talpey, Doug Voigt Microsemi, Cray, Microsoft, HPE The Suspects Stephen Bates Microsemi Paul Grun Cray
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More information2 nd Half. Memory management Disk management Network and Security Virtual machine
Final Review 1 2 nd Half Memory management Disk management Network and Security Virtual machine 2 Abstraction Virtual Memory (VM) 4GB (32bit) linear address space for each process Reality 1GB of actual
More informationXen Network I/O Performance Analysis and Opportunities for Improvement
Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 27 23 Hewlett-Packard Development Company, L.P.
More informationOnyx: A Prototype Phase-Change Memory Storage Array
Onyx: A Prototype Phase-Change Memory Storage Array Ameen Akel * Adrian Caulfield, Todor Mollov, Rajesh Gupta, Steven Swanson Non-Volatile Systems Laboratory, Department of Computer Science and Engineering
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationApplication Advantages of NVMe over Fabrics RDMA and Fibre Channel
Application Advantages of NVMe over Fabrics RDMA and Fibre Channel Brandon Hoff Broadcom Limited Tuesday, June 14 2016 10:55 11:35 a.m. Agenda r Applications that have a need for speed r The Benefits of
More information