Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Size: px

Start display at page:

Download "Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing"

Claude Fowler
5 years ago
Views:

1 Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, ebay, Georgia Tech, Chonbuk National University April 26, 2018 Changwoo Min Solros: Data-Centric OS April 26, / 21

2 Cambrian Explosion of Processor Architecture Specialization of general-purpose processors Changwoo Min Solros: Data-Centric OS April 26, / 21

3 Cambrian Explosion of Processor Architecture Specialization of general-purpose processors Generalization of co-processors Changwoo Min Solros: Data-Centric OS April 26, / 21

4 Cambrian Explosion of Processor Architecture Specialization of general-purpose processors Generalization of co-processors Specialization of co-processors Changwoo Min Solros: Data-Centric OS April 26, / 21

5 Blazingly fast IO Devices Blazingly fast storage/memory Changwoo Min Solros: Data-Centric OS April 26, / 21

6 Blazingly fast IO Devices Blazingly fast storage/memory Blazingly fast network Changwoo Min Solros: Data-Centric OS April 26, / 21

System-wide performance Ease of programming Blazingly fast

7 Blazingly fast IO Devices How to exploit the full potential of such hardware devices without pain? System-wide performance Ease of programming Blazingly fast storage/memory Blazingly fast network Changwoo Min Solros: Data-Centric OS April 26, / 21

8 Outline 1 Heterogeneous Computing Architectures 2 Solros: Split-Kernel Approach Solros Architecture Operating System Services 3 Evaluation Changwoo Min Solros: Data-Centric OS April 26, / 21

9 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor OS I/O device SSD / NIC Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21

10 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS I/O device SSD / NIC Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21

11 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS 2 I/O device SSD / NIC Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21

12 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS 2 I/O device SSD / NIC 3 Co-processor control data Changwoo Min Solros: Data-Centric OS April 26, / 21

13 Host-Centric Approach Host OS controls co-processors and IO devices Examples: OpenCL, CUDA Host processor 1 OS 2 I/O device SSD / NIC 3 Co-processor control data Problem Redundant data communication Complex to program and hard to optimize Changwoo Min Solros: Data-Centric OS April 26, / 21

14 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC Co-processor OS control data Changwoo Min Solros: Data-Centric OS April 26, / 21

15 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC Co-processor 1 OS control data Changwoo Min Solros: Data-Centric OS April 26, / 21

16 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC 2 Co-processor 1 OS control data Changwoo Min Solros: Data-Centric OS April 26, / 21

17 Coprocessor-Centric Architecture Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14] Host processor OS I/O device SSD / NIC 2 Co-processor 1 OS control data Problem Significant effort required for porting IO stack to co-processor Not completely exploiting powerful host processors Changwoo Min Solros: Data-Centric OS April 26, / 21

18 Outline 1 Heterogeneous Computing Architectures 2 Solros: Split-Kernel Approach Solros Architecture Operating System Services 3 Evaluation Changwoo Min Solros: Data-Centric OS April 26, / 21

19 Solros Goal Ease of programming Best use of processor architecture System-wide optimization Changwoo Min Solros: Data-Centric OS April 26, / 21

20 Solros Goal Ease of programming Best use of processor architecture System-wide optimization Challenge Co-processor needs IO abstraction IO stacks is branch-divergent and difficult to parallelize It needs system-wide information Changwoo Min Solros: Data-Centric OS April 26, / 21

21 Solros Architecture Split-Kernel Architecture Data-plane OS Runs on a co-processor Provides IO abstraction Delegates actual IO operations to a control-plane OS Control-plane OS Runs on a host processor Runs actual IO stack Performs system-wide coordination Changwoo Min Solros: Data-Centric OS April 26, / 21

22 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor OS stub Host processor OS proxy Policy I/O device SSD / NIC control data Changwoo Min Solros: Data-Centric OS April 26, / 21

23 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub Host processor OS proxy Policy I/O device SSD / NIC control data Changwoo Min Solros: Data-Centric OS April 26, / 21

24 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub I/O device SSD / NIC Host processor 2 OS proxy Policy control data Changwoo Min Solros: Data-Centric OS April 26, / 21

25 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub I/O device SSD / NIC 3 Host processor 2 OS proxy Policy control data Changwoo Min Solros: Data-Centric OS April 26, / 21

26 Solros Architecture Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor Co-processor 1 OS stub I/O device SSD / NIC 3 Host processor 2 OS proxy Policy control data Co-processor has OS abstraction with minimal effort Best use of each of the fat and lean processors Efficient global coordination among devices (policy) Changwoo Min Solros: Data-Centric OS April 26, / 21

27 Operating System Services 1 Transport service 2 Filesystem service 3 Network service Changwoo Min Solros: Data-Centric OS April 26, / 21

28 Operating System Services 1 Transport service 2 Filesystem service 3 Network service Changwoo Min Solros: Data-Centric OS April 26, / 21

29 Transport Service High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor Changwoo Min Solros: Data-Centric OS April 26, / 21

30 Transport Service High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor Our approach Uniform data transfer system-mapped PCIe window High contention combining, replication, interleaving, etc. Asymmetric performance flexibly configurable (host DMA engine vs. co-processor DMA engine) Changwoo Min Solros: Data-Centric OS April 26, / 21

31 Transport Service High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor Our approach Uniform data transfer system-mapped PCIe window High contention combining, replication, interleaving, etc. Asymmetric performance flexibly configurable (host DMA engine vs. co-processor DMA engine) See details in the paper Changwoo Min Solros: Data-Centric OS April 26, / 21

32 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub Host processor File system proxy File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

33 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

34 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy 3 File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

35 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy 3 4 File system PCIe DMA engine SSD control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

36 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 5 DMA engine SSD Host processor File system proxy 3 4 File system PCIe control data Zero-copy of data between co-processor memory and SSD Minimal data transfer Changwoo Min Solros: Data-Centric OS April 26, / 21

37 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy Buffer cache File system PCIe control data DMA engine SSD Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

38 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy Buffer cache 3 File system PCIe control data DMA engine SSD Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

39 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 Host processor File system proxy Buffer cache 3 File system 4 PCIe control data DMA engine SSD Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

40 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 2 control data DMA engine SSD Host processor File system proxy Buffer cache 3 File system 4 PCIe 5 Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

41 Filesystem Service Peer-to-peer operation Buffered operation Co-processor 1 File system stub 6 2 control data DMA engine SSD Host processor File system proxy Buffer cache 3 File system 4 PCIe 5 Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus Changwoo Min Solros: Data-Centric OS April 26, / 21

42 Implementation Host: 2-socket Xeon processor (12 cores each) Co-processor: 4 Xeon Phi (KNC, 61 cores, Linux, PCIe Gen 3x16) Storage device: 4 NVMe SSD NIC: 100 Gbps Ethernet Module Added lines Lines of code Deleted lines Transport service 1, File system Service Network Service Stub 5,957 2,073 Proxy 2, Stub 2, Proxy 5, NVMe device driver SCIF kernel module Total 18,844 2,714 Changwoo Min Solros: Data-Centric OS April 26, / 21

43 Evaluation Questions: Performance of Solros services Impact on real-world applications Changwoo Min Solros: Data-Centric OS April 26, / 21

44 Performance of Solros Services GB/sec (a) file random read on SSD 2MB 1MB 512KB 256KB 128KB 64KB 32KB Phi-Solros Phi-Linux Percentage of requests (%) MB 0 (b) TCP latency: 64-byte message Phi-Solros Phi-Linux Latency(usec) File IO performance: 19x faster than the stock Linux on Xeon Phi TCP latency (99 percentile): 7x shorter than the stock Linux on Xeon Phi Changwoo Min Solros: Data-Centric OS April 26, / 21

45 Performance of Solros Services 6 (a) file random read on SSD File system Block/Transport Storage 10 8 (b) TCP latency: 64-byte message Network stack Proxy/Transport Latency(msec) Phi-Linux Phi-Solros 0 Phi-Linux Phi-Solros Significant performance gain in data transport Running IO stack on co-processor is slower Changwoo Min Solros: Data-Centric OS April 26, / 21

46 Real-world - Image Search Image search engine is running on Xeon Phi Image database is on NVMe SSD (shared read-only) Image search queries are from network Elapsed Time (sec) Phi-Solros Phi-Linux Bandwidth (MB/s) Phi-Solros Phi-Linux Time(sec) Solros performs 2x faster than stock Linux on Xeon Phi Changwoo Min Solros: Data-Centric OS April 26, / 21

47 Conclusion Solros, a new operating system architecture for co-processors and fast IO devices Control-plane, data-plane architecture allow: Supporting high-level OS abstraction on co-processor Efficient global coordination among devices Near ideal IO performance from co-processor We will release source code soon Changwoo Min Solros: Data-Centric OS April 26, / 21

48 Image Search - Scalability Increase the number of Xeon Phi to 4 4 Normalized speedup The number of Xeon Phi co-processors Solros load balancing mechanism achieves near linear scaling Changwoo Min Solros: Data-Centric OS April 26, / 21

49 Related work Control-plane/data-plane OS: Arrakis [OSDI 14], IX [OSDI 14] OS for heterogeneous systems: Helios [SOSP]09], M3 [ASPLOS 16], Hydra [ASPLOS 08] IO support for GPU: PTask [SOSP 11], GPUfs [ASPLOS 13], GPUnet [OSDI 14] Changwoo Min Solros: Data-Centric OS April 26, / 21

50 Transport service Host's virtual address space Host's physical address space Device's physical address sapce App 1 App 2 Host RAM PCIe Window mmio DRAM NIC mmio DRAM mapped to Xeon Phi mapped to (in-host) (across devices) mmio DRAM... mmio NVMe Changwoo Min Solros: Data-Centric OS April 26, / 21

51 Network Service (TCP) Outbound operation Inbound operation Co-processor TCP stub Host processor TCP proxy Load balancer 2 TCP stack 1 PCIe control data NIC A load balancer on a host distributes incoming TCP connections to one of least-loaded co-processors. See details in the paper. Changwoo Min Solros: Data-Centric OS April 26, / 21

52 Discussion Hardware support other than Xeon Phi Two atomic instructions: transport service MMU: isolation among co-processor applications Scalability of control-plane OS Limited by scalability of OS service, PCIe interconnect, and performance of IO devices Changwoo Min Solros: Data-Centric OS April 26, / 21

53 Real-world - Text Search CLucene text indexing engine running on Xeon Phi Text data is on NVMe SSD Elapsed Time (sec) Phi-Solros Phi-Linux Bandwidth (GB/s) Phi-Solros Phi-Linux (virtio) Time(sec) Solros performs 19x faster than stock Linux (ext4/virtio) on Xeon Phi Changwoo Min Solros: Data-Centric OS April 26, / 21

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys