SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

Size: px
Start display at page:

Download "SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein"

Transcription

1 : Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and s Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

2 What do we do? Enable efficient file I/O for s Why? Support diverse I/O workloads involving s How? Make P2P a first class citizen within the file I/O stack Results Better throughput Standard file API cross- portability 2

3 Fast data transfers CPU mediated data Data resides in SSD transfers introduce extra latency with lower Bounded by extra copy throughput PCIe CPUIO - CPU mediated transfer 3

4 vendors support P2P & PCIe Eliminates redundant copies Not involved in data path Better power efficiency Higher throughput & lower latency [1] J. Zhang, D. Donofrio, J. Shalf, M. T. Kandemir, and M. Jung, NVMMU: A Non-volatile Memory Management Unit for Heterogeneous -SSD Architectures, [2] H.-W. Tseng, Y. Liu, M. Gahagan, J. Li, Y. Jin, and S. Swanson, Gullfoss: Accelerating and Simplifying Data Movement Among Heterogeneous Computing and Storage Resources, [3] M. Shihab, K. Taht, and M. Jung, Drive: Reconsidering Storage Accesses for Acceleration, [4] H.-W. Tseng, Q. Zhao, Y. Zhou, M. Gahagan, and S. Swanson, Morpheus: creating application objects efficiently for heterogeneous computing, 4 [5] Project Donard

5 Speedup P2P is great!* * But wait only for certain data access patterns Sequential reads (e.g. grep) x 7.6x 4.2x 512B 53 4K k 472 CPUIO - CPU mediated transfer 32k 64K 512K 1M Block size P2P is not a silver bullet! CPUIO P2P P2P is not a silver bullet Short Large sequential sequential reads: reads: P2P ~33x P2P ~1.4x Slower Speedup than CPUIO? **data is not preloaded to the page cache 5

6 What went wrong? P2P bypasses the kernel! No Page Cache Integration No read ahead Cannot utilize P$ for data reuse Hard to utilize Non-standard API No misaligned accesses LVM/MDADM incompatible No file consistency Can read stale data Requires explicit flushes to SSD 6

7 What do we want? Regular file I/O to memory int fd;... //open file... pread64(fd,gpu_buffer,20*1024,0);... 7

8 : Contributions Standard File API Underlying block device support (RAID, LVM) Activate P2P when beneficial Data Consistency + POSIX file semantics Keep POSIX file semantics + data consistency, even when CPU + work on the same file Combine Page Cache and P2P Interleave system memory and SSD when possible Read Ahead Activate read ahead mechanism when determined beneficial. Nested page cache within CPU memory for Use 8

9 P2PDMA transfer From Compatible SSD? : Destined to? High level Part of a sequential read? Data resides in page cache? pread( fd, buffer ) RQ buffer overview P-Read Ahead policy 1 P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 9

10 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 RQ P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 10

11 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 P-router 2.a 2.b P-cache checker PCIe VFS P2PDMA RQ Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 11

12 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 RQ P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 12

13 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy RQ 1 P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 13

14 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 RQ P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 14

15 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 15

16 Transfer Time [usec] : Combine page cache and P2P Sometimes the requested data resides in the P$ e.g due to previous usage of the data by CPU Transfer time from P$ and SSD P2P vs request size P2P P$ 500 Lower is better Request Size [KiB] Transferring data from P$ is faster! 16

17 : Combine page cache and P2P pread64(fd,gpu_dest,5*4096,0); //5 pages of 4KiB Page #0 Page #1 Page #2 Page #3 Page #4 Page cache is empty: Page Cache PCIe Memory Bus EMPTY Transfer P2P! P-cache checker 17

18 : Combine page cache and P2P pread64(fd,gpu_destk,5*4096,0); //5 pages of 4KiB Page #0 Page #1 Page #2 Page #3 Page #4 All pread64 contents in P$: Page Cache PCIe Memory Bus EMPTY Transfer from memory (CPUIO) P-cache checker 18

19 : Combine page cache and P2P pread64(fd,gpu_dest,5*4096,0); //5 pages of 4KiB Page #0 Page #1 Page #2 Page #3 Page #4 Page resides in SSD only Page resides in SSD and P$ Some of the data is in P$? PCIe Memory Bus Page Cache EMPTY #p1? #p3 #p0 #p2 #p4? Fine grained interleaving is a bad idea! 19

20 : Combine page cache and P2P Page resides in SSD only Page resides in SSD and P$ pread64(fd,gpu_dest,5*4096,0); //5 pages of 4KiB Page #0 Page #1 Page #2 Page #3 Page #4 40.1us 40.1us 40.1us 3 transfers of 4KiB via P2P: 120.3us Single transfer of 20KiB via P2P: 74.3us 20

21 : Fine grained interleaving = poor performance! Combine page cache and P2P SSDs: - Short IO requests are less efficient (low parallelism) - Invocation overhead per request Optimization Problem: Find the transfer schedule to minimize transfer time 21

22 Transfer Time [sec] : Combine page cache and P2P To solve the problem & get an optimal schedule we need: T s - P2P transfer time for a given request size p2p T s - P$ transfer time for a given request size P$ We model the SSD and RAM performance characteristics: - Assume P2P transfer time as piece-wise linear - Assume RAM transfer time as linear P2P 1500 P$ Request Size [KiB] 22

23 : Combine page Solution is polynomial in number of blocks Costly to calculate for every transfer cache and P2P We apply a greedy heuristic: - Examine every 3 consecutive data chunks Chunk #n Chunk #n+1 Chunk #n+2 Page resides in SSD only Page resides in SSD and P$ Calculate: T n + n n + 2 p2p vs. T p2p n + T P$ n T p2p n

24 : Combine page Solution is polynomial in number of blocks Costly to calculate for every transfer cache and P2P We apply a greedy heuristic: - Examine every 3 consecutive data chunks Greedy Heuristic is only Chunk #n Chunk #n+1 Chunk #n+2 1.6% slower than optimal Page resides in SSD only Page resides in SSD and P$ Calculate: scheduling T n + n n + 2 p2p vs. T p2p n + T P$ n T p2p n

25 P2PDMA transfer : pread( fd, buffer ) buffer Implementation: P2P & P$ P-Read Ahead policy 1 P-router 2.a 2.b P-cache checker PCIe P2P: Address tunneling Transfers P2PDMA mechanism VFS Page cache 5.b P$: Memcpy from P$ to mapped memory 3.a CPU PC RA 3.b Current FS API 5.a Block Layer NVMe Driver addr. extraction 4.a 4.b file NVMe SSD 25

26 : Implementation & Evaluation is implemented as a kernel module, patched NVME module & an LD_PRELOAD library No kernel modifications are required System Specs: - Intel P3700 NVME SSD - AMD Radeon R9 Fury & NVIDIA Tesla K40c - Ubuntu + Linux kernel Intel Core i7-5930k (6 Phys Cores) & X99 Chipset - 24GB DDR4 RAM 26

27 : Implementation & Evaluation We have evaluated the following: Threaded IO (TIOtest) Benchmark (1-4 threads): - Sequential reads (including software RAID) - Random reads/writes - Effects of P$ residency on read throughput - Effects CPU & I/O stress on read throughput Application Benchmarks - Aerial imagery rendering - accelerated log server - Image collage utilizing FS 27

28 Relative throughput % : Implementation & Evaluation Effect of P$ on Read Throughput Potential performance gains for producer-consumer workloads No data in P$, less than 5% overhead P2PDMA CPUIO All data in P$, less than 5% overhead Thpt [MB/sec]: % of file in page cache *512B reads 28

29 : Implementation & Evaluation - Store a log into SSD Accelerated Log Server - Analyze log using acceleration for string matching - Similar to fail2ban Real time configuration: - Log arrives to server - Server stores logs in SSD - analyzes logs by reading file Offline configuration: - Log is already in SSD - analyzes logs by reading file 29

30 : Implementation & Evaluation - Store a log into SSD Accelerated Log Server - Analyze log using acceleration for string matching - Similar to fail2ban Real time configuration: - Log arrives to server - Server stores logs in SSD We want our application to - analyzes logs by reading file work efficiently in any configuration Offline configuration: - Log is already in SSD - analyzes logs by reading file 30

31 BW [MBPS] I/O Contention BW [MBPS] Additional hop : Accelerated Log Server Implementation & Evaluation Real Time configuration Offline configuration P2P CPUIO Transfer Mechanism Data resides in p$ and SSD reads data from P$ 0 P2P CPUIO Transfer Mechanism Data resides in SSD only utilizes P2P 31

32 : - seamlessly integrates P2P as a first class citizen into the file I/O stack - utilizes several mechanisms to speed up data transfers transparently - With, the same code performs well under all setups Thank you! shaiberg1@tx.technion.ac.il github.com/acsl-technion/spin 32

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Important new NVMe features for optimizing the data pipeline

Important new NVMe features for optimizing the data pipeline Important new NVMe features for optimizing the data pipeline Dr. Stephen Bates, CTO Eideticom Santa Clara, CA 1 Outline Intro to NVMe Controller Memory Buffers (CMBs) Use cases for CMBs Submission Queue

More information

Linux Storage System Bottleneck Exploration

Linux Storage System Bottleneck Exploration Linux Storage System Bottleneck Exploration Bean Huo / Zoltan Szubbocsev Beanhuo@micron.com / zszubbocsev@micron.com 215 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications

More information

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin 1 Overview Acceleration for Storage NVMe for Acceleration How are we using (abusing ;-)) NVMe to support

More information

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing 216 ACM/IEEE 43rd Aual International Symposium on Computer Architecture Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan,

More information

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, Steven Swanson Department of Computer Science and Engineering University

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications

More information

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput

More information

Comparing UFS and NVMe Storage Stack and System-Level Performance in Embedded Systems

Comparing UFS and NVMe Storage Stack and System-Level Performance in Embedded Systems Comparing UFS and NVMe Storage Stack and System-Level Performance in Embedded Systems Bean Huo, Blair Pan, Peter Pan, Zoltan Szubbocsev Micron Technology Introduction Embedded storage systems have experienced

More information

Ben Walker Data Center Group Intel Corporation

Ben Walker Data Center Group Intel Corporation Ben Walker Data Center Group Intel Corporation Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation.

More information

Accelerating Data Centers Using NVMe and CUDA

Accelerating Data Centers Using NVMe and CUDA Accelerating Data Centers Using NVMe and CUDA Stephen Bates, PhD Technical Director, CSTO, PMC-Sierra Santa Clara, CA 1 Project Donard @ PMC-Sierra Donard is a PMC CTO project that leverages NVM Express

More information

Linux Storage System Analysis for e.mmc With Command Queuing

Linux Storage System Analysis for e.mmc With Command Queuing Linux Storage System Analysis for e.mmc With Command Queuing Linux is a widely used embedded OS that also manages block devices such as e.mmc, UFS and SSD. Traditionally, advanced embedded systems have

More information

Enabling the NVMe CMB and PMR Ecosystem

Enabling the NVMe CMB and PMR Ecosystem Architected for Performance Enabling the NVMe CMB and PMR Ecosystem Stephen Bates, PhD. CTO, Eideticom Oren Duer. Software Architect, Mellanox NVM Express Developers Day May 1, 2018 Outline 1. Intro to

More information

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory Dhananjoy Das, Sr. Systems Architect SanDisk Corp. 1 Agenda: Applications are KING! Storage landscape (Flash / NVM)

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

RAIN: Reinvention of RAID for the World of NVMe

RAIN: Reinvention of RAID for the World of NVMe RAIN: Reinvention of RAID for the World of NVMe Dmitrii Smirnov Principal Software Developer smirnov.d@raidix.com RAIDIX LLC 1 About the company RAIDIX is an innovative solution provider and developer

More information

Application Access to Persistent Memory The State of the Nation(s)!

Application Access to Persistent Memory The State of the Nation(s)! Application Access to Persistent Memory The State of the Nation(s)! Stephen Bates, Paul Grun, Tom Talpey, Doug Voigt Microsemi, Cray, Microsoft, HPE The Suspects Stephen Bates Microsemi Paul Grun Cray

More information

Designing Next Generation FS for NVMe and NVMe-oF

Designing Next Generation FS for NVMe and NVMe-oF Designing Next Generation FS for NVMe and NVMe-oF Liran Zvibel CTO, Co-founder Weka.IO @liranzvibel Santa Clara, CA 1 Designing Next Generation FS for NVMe and NVMe-oF Liran Zvibel CTO, Co-founder Weka.IO

More information

An NVMe-based FPGA Storage Workload Accelerator

An NVMe-based FPGA Storage Workload Accelerator An NVMe-based FPGA Storage Workload Accelerator Dr. Sean Gibb, VP Software Eideticom Santa Clara, CA 1 PCIe Bus NVMe SSD NVMe SSD Acceleration Host CPU HDD RDMA NIC NoLoad Accel. Card TM Storage I/O Bandwidth

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

Morpheus: Exploring the Potential of Near-Data Processing for Creating Application Objects in Heterogeneous Computing

Morpheus: Exploring the Potential of Near-Data Processing for Creating Application Objects in Heterogeneous Computing Morpheus: Exploring the Potential of Near-Data Processing for Creating Application Objects in Heterogeneous Computing Hung-Wei Tseng North Carolina State University Qianchen Zhao Arista Networks Yuxiao

More information

Arrakis: The Operating System is the Control Plane

Arrakis: The Operating System is the Control Plane Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building

More information

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS 13th ANNUAL WORKSHOP 2017 REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS Tom Talpey Microsoft [ March 31, 2017 ] OUTLINE Windows Persistent Memory Support A brief summary, for better

More information

Towards Weakly Consistent Local Storage Systems

Towards Weakly Consistent Local Storage Systems Towards Weakly Consistent Local Storage Systems Ji-Yong Shin 1,2, Mahesh Balakrishnan 2, Tudor Marian 3, Jakub Szefer 2, Hakim Weatherspoon 1 1 Cornell University, 2 Yale University, 3 Google 2 Consistency/Performance

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

Pactron FPGA Accelerated Computing Solutions

Pactron FPGA Accelerated Computing Solutions Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market

More information

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC INTRODUCTION With the EPYC processor line, AMD is expected to take a strong position in the server market including

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 22 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Disk Structure Disk can

More information

Performance and Optimization Issues in Multicore Computing

Performance and Optimization Issues in Multicore Computing Performance and Optimization Issues in Multicore Computing Minsoo Ryu Department of Computer Science and Engineering 2 Multicore Computing Challenges It is not easy to develop an efficient multicore program

More information

RAIN: Reinvention of RAID for the World of NVMe. Sergey Platonov RAIDIX

RAIN: Reinvention of RAID for the World of NVMe. Sergey Platonov RAIDIX RAIN: Reinvention of RAID for the World of NVMe Sergey Platonov RAIDIX 1 NVMe Market Overview > 15 vendors develop NVMe-compliant servers and appliances > 50% of servers will have NVMe slots by 2020 Market

More information

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer 2018 Storage Developer Conference. Eidetic Communications Inc. All Rights Reserved. 1 Outline Motivation

More information

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010 Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed

More information

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 ZEST Snapshot Service A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 Design Motivation To optimize science utilization of the machine Maximize

More information

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1, Rohan Kadekodi 1, Vijay Chidambaram 1,2, Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research

More information

Much Faster Networking

Much Faster Networking Much Faster Networking David Riddoch driddoch@solarflare.com Copyright 2016 Solarflare Communications, Inc. All rights reserved. What is kernel bypass? The standard receive path The standard receive path

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD Linux Software RAID Level Technique for High Performance Computing by using PCI-Express based SSD Jae Gi Son, Taegyeong Kim, Kuk Jin Jang, *Hyedong Jung Department of Industrial Convergence, Korea Electronics

More information

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University I/O, Disks, and RAID Yi Shi Fall 2017 Xi an Jiaotong University Goals for Today Disks How does a computer system permanently store data? RAID How to make storage both efficient and reliable? 2 What does

More information

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Exploring System Challenges of Ultra-Low Latency Solid State Drives Exploring System Challenges of Ultra-Low Latency Solid State Drives Sungjoon Koh Changrim Lee, Miryeong Kwon, and Myoungsoo Jung Computer Architecture and Memory systems Lab Executive Summary Motivation.

More information

Martin Dubois, ing. Contents

Martin Dubois, ing. Contents Martin Dubois, ing Contents Without OpenNet vs With OpenNet Technical information Possible applications Artificial Intelligence Deep Packet Inspection Image and Video processing Network equipment development

More information

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review COS 318: Operating Systems NSF, Snapshot, Dedup and Review Topics! NFS! Case Study: NetApp File System! Deduplication storage system! Course review 2 Network File System! Sun introduced NFS v2 in early

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

NVMe SSDs with Persistent Memory Regions

NVMe SSDs with Persistent Memory Regions NVMe SSDs with Persistent Memory Regions Chander Chadha Sr. Manager Product Marketing, Toshiba Memory America, Inc. 2018 Toshiba Memory America, Inc. August 2018 1 Agenda q Why Persistent Memory is needed

More information

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel Chapter-6 SUBJECT:- Operating System TOPICS:- I/O Management Created by : - Sanjay Patel Disk Scheduling Algorithm 1) First-In-First-Out (FIFO) 2) Shortest Service Time First (SSTF) 3) SCAN 4) Circular-SCAN

More information

EE 457 Unit 7b. Main Memory Organization

EE 457 Unit 7b. Main Memory Organization 1 EE 457 Unit 7b Main Memory Organization 2 Motivation Organize main memory to Facilitate byte-addressability while maintaining Efficient fetching of the words in a cache block Low order interleaving (L.O.I)

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

CS5460: Operating Systems Lecture 20: File System Reliability

CS5460: Operating Systems Lecture 20: File System Reliability CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving

More information

CSCI 350 Ch. 12 Storage Device. Mark Redekopp Michael Shindler & Ramesh Govindan

CSCI 350 Ch. 12 Storage Device. Mark Redekopp Michael Shindler & Ramesh Govindan 1 CSCI 350 Ch. 12 Storage Device Mark Redekopp Michael Shindler & Ramesh Govindan 2 Introduction Storage HW limitations Poor randomaccess Asymmetric read/write performance Reliability issues File system

More information

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF 14th ANNUAL WORKSHOP 2018 THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF Paul Luse Intel Corporation Apr 2018 AGENDA Storage Performance Development Kit What is SPDK? The SPDK Community Why are so

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Using Transparent Compression to Improve SSD-based I/O Caches

Using Transparent Compression to Improve SSD-based I/O Caches Using Transparent Compression to Improve SSD-based I/O Caches Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

Leveraging Hybrid Hardware in New Ways: The GPU Paging Cache

Leveraging Hybrid Hardware in New Ways: The GPU Paging Cache Leveraging Hybrid Hardware in New Ways: The GPU Paging Cache Frank Feinbube, Peter Tröger, Johannes Henning, Andreas Polze Hasso Plattner Institute Operating Systems and Middleware Prof. Dr. Andreas Polze

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

Data Storage and Query Answering. Data Storage and Disk Structure (2)

Data Storage and Query Answering. Data Storage and Disk Structure (2) Data Storage and Query Answering Data Storage and Disk Structure (2) Review: The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM @200MHz) 6,400

More information

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit Agarwal Slides based on: many many discussions with Ion Stoica, his class, and many industry folks Servers Typical

More information

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems

More information

Structuring PLFS for Extensibility

Structuring PLFS for Extensibility Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University What is PLFS? Parallel Log Structured File System Interposed filesystem b/w

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Workload Performance Comparison. Eden Kim, Calypso Systems, Inc.

Workload Performance Comparison. Eden Kim, Calypso Systems, Inc. Workload Performance Comparison Eden Kim, Calypso Systems, Inc. Agenda Part 1: Workload Comparisons: Part 2: NVMe SSD, & Mmap Part 3: DAX: Conclusions 2 Part 1: Workload Comparisons Synthetic Benchmark:

More information

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson A Cross Media File System Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson 1 Let s build a fast server NoSQL store, Database, File server, Mail server Requirements

More information

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine

More information

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) I/O Devices Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Hardware Support for I/O CPU RAM Network Card Graphics Card Memory Bus General I/O Bus (e.g., PCI) Canonical Device OS reads/writes

More information

Linux Device Drivers: Case Study of a Storage Controller. Manolis Marazakis FORTH-ICS (CARV)

Linux Device Drivers: Case Study of a Storage Controller. Manolis Marazakis FORTH-ICS (CARV) Linux Device Drivers: Case Study of a Storage Controller Manolis Marazakis FORTH-ICS (CARV) IOP348-based I/O Controller Programmable I/O Controller Continuous Data Protection: Versioning (snapshots), Migration,

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage

More information

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs 1 Public and Private Cloud Providers 2 Workloads and Applications Multi-Tenancy Databases Instance

More information

An FPGA-Based Optical IOH Architecture for Embedded System

An FPGA-Based Optical IOH Architecture for Embedded System An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices

More information

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, ebay, Georgia

More information

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections ) Lecture 23: Storage Systems Topics: disk access, bus design, evaluation metrics, RAID (Sections 7.1-7.9) 1 Role of I/O Activities external to the CPU are typically orders of magnitude slower Example: while

More information

Leveraging Burst Buffer Coordination to Prevent I/O Interference

Leveraging Burst Buffer Coordination to Prevent I/O Interference Leveraging Burst Buffer Coordination to Prevent I/O Interference Anthony Kougkas akougkas@hawk.iit.edu Matthieu Dorier, Rob Latham, Rob Ross, Xian-He Sun Wednesday, October 26th Baltimore, USA Outline

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi, Jiwon Kim, Hwansoo Han Operations Per Second Storage Latency Close to DRAM SATA/SAS Flash SSD (~00μs) PCIe Flash SSD (~60 μs) D-XPoint

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

HMM: GUP NO MORE! XDC Jérôme Glisse

HMM: GUP NO MORE! XDC Jérôme Glisse HMM: GUP NO MORE! XDC 2018 Jérôme Glisse HETEROGENEOUS COMPUTING CPU is dead, long live the CPU Heterogeneous computing is back, one device for each workload: GPUs for massively parallel workload Accelerators

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Changlim Lee, Mohammad Alian, Myoungjun Chun,

More information

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GpuWrapper: A Portable API for Heterogeneous Programming at CGG GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &

More information

Comprehensive Kernel Instrumentation via Dynamic Binary Translation

Comprehensive Kernel Instrumentation via Dynamic Binary Translation Comprehensive Kernel Instrumentation via Dynamic Binary Translation Peter Feiner Angela Demke Brown Ashvin Goel University of Toronto 011 Complexity of Operating Systems 012 Complexity of Operating Systems

More information

CS140 Operating Systems Midterm Review. Feb. 5 th, 2009 Derrick Isaacson

CS140 Operating Systems Midterm Review. Feb. 5 th, 2009 Derrick Isaacson CS140 Operating Systems Midterm Review Feb. 5 th, 2009 Derrick Isaacson Midterm Quiz Tues. Feb. 10 th In class (4:15-5:30 Skilling) Open book, open notes (closed laptop) Bring printouts You won t have

More information

Storage. Hwansoo Han

Storage. Hwansoo Han Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics

More information

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing

More information

Flash Memory Summit Persistent Memory - NVDIMMs

Flash Memory Summit Persistent Memory - NVDIMMs Flash Memory Summit 2018 Persistent Memory - NVDIMMs Contents Persistent Memory Overview NVDIMM Conclusions 2 Persistent Memory Memory & Storage Convergence Today Volatile and non-volatile technologies

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group Operating Systems (2INC0) 20/19 Introduction (01) Dr. Courtesy of Prof. Dr. Johan Lukkien System Architecture and Networking Group Course Overview Introduction to operating systems Processes, threads and

More information

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Silberschatz, Galvin and Gagne 2013 Chapter 12: File System Implementation File-System Structure File-System Implementation Allocation Methods Free-Space Management

More information

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net Cougar Open CL v1.0 Users Guide for Open CL support for Delphi/ C++Builder and.net MtxVec version v4, rev 2.0 1999-2011 Dew Research www.dewresearch.com Table of Contents Cougar Open CL v1.0... 2 1 About

More information

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system Fei Liu, Sheng Qiu, Jianjian Huo, Shu Li Alibaba Group Santa Clara, CA 1 Software overhead become critical Legacy

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Generic System Calls for GPUs

Generic System Calls for GPUs Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices

More information

Distributed caching for cloud computing

Distributed caching for cloud computing Distributed caching for cloud computing Maxime Lorrillere, Julien Sopena, Sébastien Monnet et Pierre Sens February 11, 2013 Maxime Lorrillere (LIP6/UPMC/CNRS) February 11, 2013 1 / 16 Introduction Context

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb. Messaging App IB Verbs NTRDMA dmaengine.h ntb.h DMA DMA DMA NTRDMA v0.1 An Open Source Driver for PCIe and DMA Allen Hubbe at Linux Piter 2015 1 INTRODUCTION Allen Hubbe Senior Software Engineer EMC Corporation

More information

GPUs as better MPI Citizens

GPUs as better MPI Citizens s as better MPI Citizens Author: Dale Southard, NVIDIA Date: 4/6/2011 www.openfabrics.org 1 Technology Conference 2011 October 11-14 San Jose, CA The one event you can t afford to miss Learn about leading-edge

More information

Scalability, Performance & Caching

Scalability, Performance & Caching COMP 150-IDS: Internet Scale Distributed Systems (Spring 2018) Scalability, Performance & Caching Noah Mendelsohn Tufts University Email: noah@cs.tufts.edu Web: http://www.cs.tufts.edu/~noah Copyright

More information

How Next Generation NV Technology Affects Storage Stacks and Architectures

How Next Generation NV Technology Affects Storage Stacks and Architectures How Next Generation NV Technology Affects Storage Stacks and Architectures Marty Czekalski, Interface and Emerging Architecture Program Manager, Seagate Technology Flash Memory Summit 2013 Santa Clara,

More information