Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Size: px
Start display at page:

Download "Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing"

Transcription

1 Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, ebay, Georgia Tech, Chonbuk National University April 26, 2018 Changwoo Min Solros: Data-Centric OS April 26, / 21

2 Cambrian Explosion of Processor Architecture Specialization of general-purpose processors Changwoo Min Solros: Data-Centric OS April 26, / 21

3 Cambrian Explosion of Processor Architecture Specialization of general-purpose processors Generalization of co-processors Changwoo Min Solros: Data-Centric OS April 26, / 21

4 Cambrian Explosion of Processor Architecture Specialization of general-purpose processors Generalization of co-processors Specialization of co-processors Changwoo Min Solros: Data-Centric OS April 26, / 21

5 Blazingly fast IO Devices Blazingly fast storage/memory Changwoo Min Solros: Data-Centric OS April 26, / 21

6 Blazingly fast IO Devices Blazingly fast storage/memory Blazingly fast network Changwoo Min Solros: Data-Centric OS April 26, / 21

7 Blazingly fast IO Devices How to exploit the full potential of such hardware devices without pain? System-wide performance Ease of programming Blazingly fast storage/memory Blazingly fast network Changwoo Min Solros: Data-Centric OS April 26, / 21

8 Outline 1 Heterogeneous Computing Architectures 2 Solros: Split-Kernel Approach Solros Architecture Operating System Services 3 Evaluation Changwoo Min Solros: Data-Centric OS April 26, / 21

9 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor OS I/O device SSD / NIC Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21

10 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS I/O device SSD / NIC Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21

11 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS 2 I/O device SSD / NIC Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21

12 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS 2 I/O device SSD / NIC 3 Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21

13 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS 2 I/O device SSD / NIC 3 Co-processor control data Problem Redundant data communication Complex to program and hard to optimize Changwoo Min Solros: Data-Centric OS April 26, / 21

14 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC Co-processor OS control data Changwoo Min Solros: Data-Centric OS April 26, / 21

15 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC Co-processor 1 OS control data Changwoo Min Solros: Data-Centric OS April 26, / 21

16 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC 2 Co-processor 1 OS control data Changwoo Min Solros: Data-Centric OS April 26, / 21

17 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC 2 Co-processor 1 OS control data Problem Significant effort required for porting IO stack to co-processor Not completely exploiting powerful host processors Changwoo Min Solros: Data-Centric OS April 26, / 21

18 Outline 1 Heterogeneous Computing Architectures 2 Solros: Split-Kernel Approach Solros Architecture Operating System Services 3 Evaluation Changwoo Min Solros: Data-Centric OS April 26, / 21

19 Solros Goal Ease of programming Best use of processor architecture System-wide optimization Changwoo Min Solros: Data-Centric OS April 26, / 21

20 Solros Goal Ease of programming Best use of processor architecture System-wide optimization Challenge Co-processor needs IO abstraction IO stacks is branch-divergent and difficult to parallelize It needs system-wide information Changwoo Min Solros: Data-Centric OS April 26, / 21

21 Solros Architecture Split-Kernel Architecture Data-plane OS Runs on a co-processor Provides IO abstraction Delegates actual IO operations to a control-plane OS Control-plane OS Runs on a host processor Runs actual IO stack Performs system-wide coordination Changwoo Min Solros: Data-Centric OS April 26, / 21

22 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor OS stub Host processor OS proxy Policy I/O device SSD / NIC control data Changwoo Min Solros: Data-Centric OS April 26, / 21

23 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub Host processor OS proxy Policy I/O device SSD / NIC control data Changwoo Min Solros: Data-Centric OS April 26, / 21

24 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub I/O device SSD / NIC Host processor 2 OS proxy Policy control data Changwoo Min Solros: Data-Centric OS April 26, / 21

25 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub I/O device SSD / NIC 3 Host processor 2 OS proxy Policy control data Changwoo Min Solros: Data-Centric OS April 26, / 21

26 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub I/O device SSD / NIC 3 Host processor 2 OS proxy Policy control data Co-processor has OS abstraction with minimal effort Best use of each of the fat and lean processors Efficient global coordination among devices (policy) Changwoo Min Solros: Data-Centric OS April 26, / 21

27 Operating System Services 1 Transport service 2 Filesystem service 3 Network service Changwoo Min Solros: Data-Centric OS April 26, / 21

28 Operating System Services 1 Transport service 2 Filesystem service 3 Network service Changwoo Min Solros: Data-Centric OS April 26, / 21

29 Transport Service High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor Changwoo Min Solros: Data-Centric OS April 26, / 21

30 Transport Service High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor Our approach Uniform data transfer system-mapped PCIe window High contention combining, replication, interleaving, etc. Asymmetric performance flexibly configurable (host DMA engine vs. co-processor DMA engine) Changwoo Min Solros: Data-Centric OS April 26, / 21

31 Transport Service High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor Our approach Uniform data transfer system-mapped PCIe window High contention combining, replication, interleaving, etc. Asymmetric performance flexibly configurable (host DMA engine vs. co-processor DMA engine) See details in the paper Changwoo Min Solros: Data-Centric OS April 26, / 21

32 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub Host processor File system proxy File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

33 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

34 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy 3 File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

35 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy 3 4 File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

36 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 5 DMA engine SSD Host processor File system proxy 3 4 File system PCIe control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

37 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy Buffer cache File system PCIe control data DMA engine SSD Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

38 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy Buffer cache 3 File system PCIe control data DMA engine SSD Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

39 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy Buffer cache 3 File system 4 PCIe control data DMA engine SSD Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

40 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 control data DMA engine SSD Host processor File system proxy Buffer cache 3 File system 4 PCIe 5 Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

41 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 6 2 control data DMA engine SSD Host processor File system proxy Buffer cache 3 File system 4 PCIe 5 Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

42 Implementation Host: 2-socket Xeon processor (12 cores each) Co-processor: 4 Xeon Phi (KNC, 61 cores, Linux, PCIe Gen 3x16) Storage device: 4 NVMe SSD NIC: 100 Gbps Ethernet Module Added lines Lines of code Deleted lines Transport service 1, File system Service Network Service Stub 5,957 2,073 Proxy 2, Stub 2, Proxy 5, NVMe device driver SCIF kernel module Total 18,844 2,714 Changwoo Min Solros: Data-Centric OS April 26, / 21

43 Evaluation Questions: Performance of Solros services Impact on real-world applications Changwoo Min Solros: Data-Centric OS April 26, / 21

44 Performance of Solros Services GB/sec (a) file random read on SSD 2MB 1MB 512KB 256KB 128KB 64KB 32KB Phi-Solros Phi-Linux Percentage of requests (%) MB 0 (b) TCP latency: 64-byte message Phi-Solros Phi-Linux Latency(usec) File IO performance: 19x faster than the stock Linux on Xeon Phi TCP latency (99 percentile): 7x shorter than the stock Linux on Xeon Phi Changwoo Min Solros: Data-Centric OS April 26, / 21

45 Performance of Solros Services 6 (a) file random read on SSD File system Block/Transport Storage 10 8 (b) TCP latency: 64-byte message Network stack Proxy/Transport Latency(msec) Phi-Linux Phi-Solros 0 Phi-Linux Phi-Solros Significant performance gain in data transport Running IO stack on co-processor is slower Changwoo Min Solros: Data-Centric OS April 26, / 21

46 Real-world - Image Search Image search engine is running on Xeon Phi Image database is on NVMe SSD (shared read-only) Image search queries are from network Elapsed Time (sec) Phi-Solros Phi-Linux Bandwidth (MB/s) Phi-Solros Phi-Linux Time(sec) Solros performs 2x faster than stock Linux on Xeon Phi Changwoo Min Solros: Data-Centric OS April 26, / 21

47 Conclusion Solros, a new operating system architecture for co-processors and fast IO devices Control-plane, data-plane architecture allow: Supporting high-level OS abstraction on co-processor Efficient global coordination among devices Near ideal IO performance from co-processor We will release source code soon Changwoo Min Solros: Data-Centric OS April 26, / 21

48 Image Search - Scalability Increase the number of Xeon Phi to 4 4 Normalized speedup The number of Xeon Phi co-processors Solros load balancing mechanism achieves near linear scaling Changwoo Min Solros: Data-Centric OS April 26, / 21

49 Related work Control-plane/data-plane OS: Arrakis [OSDI 14], IX [OSDI 14] OS for heterogeneous systems: Helios [SOSP]09], M3 [ASPLOS 16], Hydra [ASPLOS 08] IO support for GPU: PTask [SOSP 11], GPUfs [ASPLOS 13], GPUnet [OSDI 14] Changwoo Min Solros: Data-Centric OS April 26, / 21

50 Transport service Host's virtual address space Host's physical address space Device's physical address sapce App 1 App 2 Host RAM PCIe Window mmio DRAM NIC mmio DRAM mapped to Xeon Phi mapped to (in-host) (across devices) mmio DRAM... mmio NVMe Changwoo Min Solros: Data-Centric OS April 26, / 21

51 Network Service (TCP) Outbound operation Inbound operation Co-processor TCP stub Host processor TCP proxy Load balancer 2 TCP stack 1 PCIe control data NIC A load balancer on a host distributes incoming TCP connections to one of least-loaded co-processors. See details in the paper. Changwoo Min Solros: Data-Centric OS April 26, / 21

52 Discussion Hardware support other than Xeon Phi Two atomic instructions: transport service MMU: isolation among co-processor applications Scalability of control-plane OS Limited by scalability of OS service, PCIe interconnect, and performance of IO devices Changwoo Min Solros: Data-Centric OS April 26, / 21

53 Real-world - Text Search CLucene text indexing engine running on Xeon Phi Text data is on NVMe SSD Elapsed Time (sec) Phi-Solros Phi-Linux Bandwidth (GB/s) Phi-Solros Phi-Linux (virtio) Time(sec) Solros performs 19x faster than stock Linux (ext4/virtio) on Xeon Phi Changwoo Min Solros: Data-Centric OS April 26, / 21

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic

More information

RDMA and Hardware Support

RDMA and Hardware Support RDMA and Hardware Support SIGCOMM Topic Preview 2018 Yibo Zhu Microsoft Research 1 The (Traditional) Journey of Data How app developers see the network Under the hood This architecture had been working

More information

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan LegoOS A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang Y 4 1 2 Monolithic Server OS / Hypervisor 3 Problems? 4 cpu mem Resource

More information

Accelerator-centric operating systems

Accelerator-centric operating systems Accelerator-centric operating systems Rethinking the role of s in modern computers Mark Silberstein EE, Technion System design challenge: Programmability and Performance 2 System design challenge: Programmability

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

PASTE: A Network Programming Interface for Non-Volatile Main Memory

PASTE: A Network Programming Interface for Non-Volatile Main Memory PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018

More information

Re-architecting Virtualization in Heterogeneous Multicore Systems

Re-architecting Virtualization in Heterogeneous Multicore Systems Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College

More information

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration

More information

PASTE: A Networking API for Non-Volatile Main Memory

PASTE: A Networking API for Non-Volatile Main Memory PASTE: A Networking API for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Lars Eggert (NetApp) Douglas Santry (NetApp) TSVAREA@IETF 99, Prague May 22th 2017 More details at our HotNets

More information

Important new NVMe features for optimizing the data pipeline

Important new NVMe features for optimizing the data pipeline Important new NVMe features for optimizing the data pipeline Dr. Stephen Bates, CTO Eideticom Santa Clara, CA 1 Outline Intro to NVMe Controller Memory Buffers (CMBs) Use cases for CMBs Submission Queue

More information

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang Datacenter 3 Monolithic Computer OS / Hypervisor 4 Can monolithic Application Hardware

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi, Jiwon Kim, Hwansoo Han Operations Per Second Storage Latency Close to DRAM SATA/SAS Flash SSD (~00μs) PCIe Flash SSD (~60 μs) D-XPoint

More information

Accelerating Data Centers Using NVMe and CUDA

Accelerating Data Centers Using NVMe and CUDA Accelerating Data Centers Using NVMe and CUDA Stephen Bates, PhD Technical Director, CSTO, PMC-Sierra Santa Clara, CA 1 Project Donard @ PMC-Sierra Donard is a PMC CTO project that leverages NVM Express

More information

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing

More information

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level

More information

Farewell to Servers: Resource Disaggregation

Farewell to Servers: Resource Disaggregation Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / Hypervisor 3 Can monolithic Application Hardware servers

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck Managing Array of Ds When the torage Device is No Longer the Performance Bottleneck Byung. Kim, Jaeho Kim, am H. Noh UNIT (Ulsan National Institute of cience & Technology) Outline Motivation & Observation

More information

Memory-Based Cloud Architectures

Memory-Based Cloud Architectures Memory-Based Cloud Architectures ( Or: Technical Challenges for OnDemand Business Software) Jan Schaffner Enterprise Platform and Integration Concepts Group Example: Enterprise Benchmarking -) *%'+,#$)

More information

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007 Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise

More information

Application Acceleration Beyond Flash Storage

Application Acceleration Beyond Flash Storage Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage

More information

Generic System Calls for GPUs

Generic System Calls for GPUs Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices

More information

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT Konstantinos Alexopoulos ECE NTUA CSLab MOTIVATION HPC, Multi-node & Heterogeneous Systems Communication with low latency

More information

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems

More information

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC Three Consortia Formed in Oct 2016 Gen-Z Open CAPI CCIX complex to rack scale memory fabric Cache coherent accelerator

More information

I/O Systems. Jo, Heeseung

I/O Systems. Jo, Heeseung I/O Systems Jo, Heeseung Today's Topics Device characteristics Block device vs. Character device Direct I/O vs. Memory-mapped I/O Polling vs. Interrupts Programmed I/O vs. DMA Blocking vs. Non-blocking

More information

Martin Dubois, ing. Contents

Martin Dubois, ing. Contents Martin Dubois, ing Contents Without OpenNet vs With OpenNet Technical information Possible applications Artificial Intelligence Deep Packet Inspection Image and Video processing Network equipment development

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research 1 The world s most valuable resource Data is everywhere! May. 2017 Values from Data! Need infrastructures for

More information

Understanding Manycore Scalability of File Systems. Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim

Understanding Manycore Scalability of File Systems. Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim Application must parallelize I/O operations Death of single core CPU scaling

More information

Extremely Fast Distributed Storage for Cloud Service Providers

Extremely Fast Distributed Storage for Cloud Service Providers Solution brief Intel Storage Builders StorPool Storage Intel SSD DC S3510 Series Intel Xeon Processor E3 and E5 Families Intel Ethernet Converged Network Adapter X710 Family Extremely Fast Distributed

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications

More information

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS 13th ANNUAL WORKSHOP 2017 REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS Tom Talpey Microsoft [ March 31, 2017 ] OUTLINE Windows Persistent Memory Support A brief summary, for better

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput

More information

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications Outline RDMA Motivating trends iwarp NFS over RDMA Overview Chelsio T5 support Performance results 2 Adoption Rate of 40GbE Source: Crehan

More information

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin 1 Overview Acceleration for Storage NVMe for Acceleration How are we using (abusing ;-)) NVMe to support

More information

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance TechTarget Dennis Martin 1 Agenda About Demartek I/O Virtualization Concepts RDMA Concepts Examples Demartek

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

DPDK Summit China 2017

DPDK Summit China 2017 Summit China 2017 Embedded Network Architecture Optimization Based on Lin Hao T1 Networks Agenda Our History What is an embedded network device Challenge to us Requirements for device today Our solution

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Adam Belay et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Presented by Han Zhang & Zaina Hamid Challenges

More information

Eleos: Exit-Less OS Services for SGX Enclaves

Eleos: Exit-Less OS Services for SGX Enclaves Eleos: Exit-Less OS Services for SGX Enclaves Meni Orenbach Marina Minkin Pavel Lifshits Mark Silberstein Accelerated Computing Systems Lab Haifa, Israel What do we do? Improve performance: I/O intensive

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage

More information

Near- Data Computa.on: It s Not (Just) About Performance

Near- Data Computa.on: It s Not (Just) About Performance Near- Data Computa.on: It s Not (Just) About Performance Steven Swanson Non- Vola0le Systems Laboratory Computer Science and Engineering University of California, San Diego 1 Solid State Memories NAND

More information

This presentation provides an overview of Gen Z architecture and its application in multiple use cases.

This presentation provides an overview of Gen Z architecture and its application in multiple use cases. This presentation provides an overview of Gen Z architecture and its application in multiple use cases. 1 2 Despite numerous advances in data storage and computation, data access complexity continues to

More information

IBM Power AC922 Server

IBM Power AC922 Server IBM Power AC922 Server The Best Server for Enterprise AI Highlights More accuracy - GPUs access system RAM for larger models Faster insights - significant deep learning speedups Rapid deployment - integrated

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Designing Next Generation FS for NVMe and NVMe-oF

Designing Next Generation FS for NVMe and NVMe-oF Designing Next Generation FS for NVMe and NVMe-oF Liran Zvibel CTO, Co-founder Weka.IO @liranzvibel Santa Clara, CA 1 Designing Next Generation FS for NVMe and NVMe-oF Liran Zvibel CTO, Co-founder Weka.IO

More information

Understanding and Improving the Cost of Scaling Distributed Event Processing

Understanding and Improving the Cost of Scaling Distributed Event Processing Understanding and Improving the Cost of Scaling Distributed Event Processing Shoaib Akram, Manolis Marazakis, and Angelos Bilas shbakram@ics.forth.gr Foundation for Research and Technology Hellas (FORTH)

More information

IBM Power Advanced Compute (AC) AC922 Server

IBM Power Advanced Compute (AC) AC922 Server IBM Power Advanced Compute (AC) AC922 Server The Best Server for Enterprise AI Highlights IBM Power Systems Accelerated Compute (AC922) server is an acceleration superhighway to enterprise- class AI. A

More information

Extending the NVMHCI Standard to Enterprise

Extending the NVMHCI Standard to Enterprise Extending the NVMHCI Standard to Enterprise Amber Huffman Principal Engineer Intel Corporation August 2009 1 Outline Remember: What is NVMHCI PCIe SSDs Coming with Challenges Enterprise Extensions to NVMHCI

More information

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC To Infiniband or Not Infiniband, One Site s s Perspective Steve Woods MCNC 1 Agenda Infiniband background Current configuration Base Performance Application performance experience Future Conclusions 2

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Designing a True Direct-Access File System with DevFS

Designing a True Direct-Access File System with DevFS Designing a True Direct-Access File System with DevFS Sudarsun Kannan, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau University of Wisconsin-Madison Yuangang Wang, Jun Xu, Gopinath Palani Huawei Technologies

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Project n 100021 Astrit Ademaj, TTTech Computertechnik AG Outline GENESYS requirements

More information

GPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski

GPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski Author: Andrzej Jackowski 1 Author: Andrzej Jackowski 2 GPU programming problem 3 GPU distributed application flow 1. recv req Network 4. send repl 2. exec on GPU CPU & Memory 3. get results GPU & Memory

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

EXPERIENCES WITH NVME OVER FABRICS

EXPERIENCES WITH NVME OVER FABRICS 13th ANNUAL WORKSHOP 2017 EXPERIENCES WITH NVME OVER FABRICS Parav Pandit, Oren Duer, Max Gurtovoy Mellanox Technologies [ 31 March, 2017 ] BACKGROUND: NVME TECHNOLOGY Optimized for flash and next-gen

More information

ETHOS A Generic Ethernet over Sockets Driver for Linux

ETHOS A Generic Ethernet over Sockets Driver for Linux ETHOS A Generic Ethernet over Driver for Linux Parallel and Distributed Computing and Systems Rainer Finocchiaro Tuesday November 18 2008 CHAIR FOR OPERATING SYSTEMS Outline Motivation Architecture of

More information

p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting

p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting 1 Nomenclature: A Reminder PCIe Peer-2Peer using p2pmem is NOT blucky!! 2 The Rationale: A Reminder 3 The Rationale: A

More information

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based

More information

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai, Yiying Zhang Time 2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 Berkeley Socket TCP Offload engine Arrakis & mtcp IX Userspace

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

Unblinding the OS to Optimize User-Perceived Flash SSD Latency

Unblinding the OS to Optimize User-Perceived Flash SSD Latency Unblinding the OS to Optimize User-Perceived Flash SSD Latency Woong Shin *, Jaehyun Park **, Heon Y. Yeom * * Seoul National University ** Arizona State University USENIX HotStorage 2016 Jun. 21, 2016

More information

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence TM Artificial Intelligence Server for Fraud Date: Q2 2017 Application: Artificial Intelligence Tags: Artificial intelligence, GPU, GTX 1080 TI HM Revenue & Customs The UK s tax, payments and customs authority

More information

CSC501 Operating Systems Principles. OS Structure

CSC501 Operating Systems Principles. OS Structure CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last

More information

No Tradeoff Low Latency + High Efficiency

No Tradeoff Low Latency + High Efficiency No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS),

More information

Software and Tools for HPE s The Machine Project

Software and Tools for HPE s The Machine Project Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric

More information

WormTerminator: : An Effective Containment of Unknown and Polymorphic Fast Spreading Worms

WormTerminator: : An Effective Containment of Unknown and Polymorphic Fast Spreading Worms WormTerminator: : An Effective Containment of Unknown and Polymorphic Fast Spreading Worms Songqing Chen, Xinyuan Wang, Lei Liu George Mason University, VA Xinwen Zhang Samsung Computer Science Lab, CA

More information

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center Jeff Defilippi Senior Product Manager Arm #Arm Tech Symposia The Cloud to Edge Infrastructure Foundation for a World of 1T Intelligent

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of

More information

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity

More information

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Pocket: Elastic Ephemeral Storage for Serverless Analytics Pocket: Elastic Ephemeral Storage for Serverless Analytics Ana Klimovic*, Yawen Wang*, Patrick Stuedi +, Animesh Trivedi +, Jonas Pfefferle +, Christos Kozyrakis* *Stanford University, + IBM Research 1

More information

M 3 Microkernel-based System for Heterogeneous Manycores

M 3 Microkernel-based System for Heterogeneous Manycores M 3 Microkernel-based System for Heterogeneous Manycores Nils Asmussen MKC, 06/29/2017 1 / 48 Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation Heterogeneous

More information

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han Seoul National University, Korea Dongduk Women s University, Korea Contents Motivation and Background

More information

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming

More information

An FPGA-Based Optical IOH Architecture for Embedded System

An FPGA-Based Optical IOH Architecture for Embedded System An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing

More information

The Why and How of Developing All-Flash Storage Server

The Why and How of Developing All-Flash Storage Server The Why and How of Developing All-Flash Storage Server June 2016 Jungsoo Kim Manager, SK Telecom Agenda Why we care about All-Flash Storage Transforming to 5G Network Open HW & SW Projects @ SKT Our approaches

More information

GPUs have enormous power that is enormously difficult to use

GPUs have enormous power that is enormously difficult to use 524 GPUs GPUs have enormous power that is enormously difficult to use Nvidia GP100-5.3TFlops of double precision This is equivalent to the fastest super computer in the world in 2001; put a single rack

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010 Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed

More information

The SNIA NVM Programming Model. #OFADevWorkshop

The SNIA NVM Programming Model. #OFADevWorkshop The SNIA NVM Programming Model #OFADevWorkshop Opportunities with Next Generation NVM NVMe & STA SNIA 2 NVM Express/SCSI Express: Optimized storage interconnect & driver SNIA NVM Programming TWG: Optimized

More information

Open Fabrics Workshop 2013

Open Fabrics Workshop 2013 Open Fabrics Workshop 2013 OFS Software for the Intel Xeon Phi Bob Woodruff Agenda Intel Coprocessor Communication Link (CCL) Software IBSCIF RDMA from Host to Intel Xeon Phi Direct HCA Access from Intel

More information

Initial Evaluation of a User-Level Device Driver Framework

Initial Evaluation of a User-Level Device Driver Framework Initial Evaluation of a User-Level Device Driver Framework Stefan Götz Karlsruhe University Germany sgoetz@ira.uka.de Kevin Elphinstone National ICT Australia University of New South Wales kevine@cse.unsw.edu.au

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Application Access to Persistent Memory The State of the Nation(s)!

Application Access to Persistent Memory The State of the Nation(s)! Application Access to Persistent Memory The State of the Nation(s)! Stephen Bates, Paul Grun, Tom Talpey, Doug Voigt Microsemi, Cray, Microsoft, HPE The Suspects Stephen Bates Microsemi Paul Grun Cray

More information

Tile Processor (TILEPro64)

Tile Processor (TILEPro64) Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth

More information

2 nd Half. Memory management Disk management Network and Security Virtual machine

2 nd Half. Memory management Disk management Network and Security Virtual machine Final Review 1 2 nd Half Memory management Disk management Network and Security Virtual machine 2 Abstraction Virtual Memory (VM) 4GB (32bit) linear address space for each process Reality 1GB of actual

More information

Xen Network I/O Performance Analysis and Opportunities for Improvement

Xen Network I/O Performance Analysis and Opportunities for Improvement Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 27 23 Hewlett-Packard Development Company, L.P.

More information

Onyx: A Prototype Phase-Change Memory Storage Array

Onyx: A Prototype Phase-Change Memory Storage Array Onyx: A Prototype Phase-Change Memory Storage Array Ameen Akel * Adrian Caulfield, Todor Mollov, Rajesh Gupta, Steven Swanson Non-Volatile Systems Laboratory, Department of Computer Science and Engineering

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel Application Advantages of NVMe over Fabrics RDMA and Fibre Channel Brandon Hoff Broadcom Limited Tuesday, June 14 2016 10:55 11:35 a.m. Agenda r Applications that have a need for speed r The Benefits of

More information