Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Size: px

Start display at page:

Download "Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications"

Scarlett Montgomery
5 years ago
Views:

1 Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University

2 Presentation Outline Introduction and Motivation Distributed Data Sharing Substrate Proposed Design Optimizations Experimental Results Conclusions and Future Work

Introduction and Motivation Stock markets Airline industries Medical imaging

tickets, medical imaging, online auction, online banking, web streaming,

capabilities Processes data and reply to client queries Common and

3 Introduction and Motivation Stock markets Airline industries Medical imaging Online auction Interactive data-driven applications Stock trading, airline tickets, medical imaging, online auction, online banking, web streaming, Ability to interact, synthesize and visualize data Datacenters enable such capabilities Processes data and reply to client queries Common and increasing in size (IBM, Amazon, Google) Datacenters unable to meet increasing client demands

4 Datacenter Architecture Clients WAN Resource monitoring (Ganglia), resource mgmt (IBM WebSphere), caching More Computation and Communication Requirements Proxy/Web Server (Apache, STORM) Application Server (PHP, CGI) Database Server (MySQL, DB2) Storage Tier 0 Tier 1 Tier 2 Applications host web content online Services improve performance and scalability State sharing is common in applications and services Communicate and synchronize (intra-node, intra-tier and inter-tier)

5 State Sharing in Datacenters Proxy Server Resource Resource Apache Network Tier System 1 monitoring adaptation state Memory copies load Apache IPC Caching Application Server STORM Resource Load Network Tier System 1 balancing adaptation state Memory copies load STORM A STORM B IPC Tier 0 Resource Caching Network Caching Tier Data 1 adaptation Memory copies load Resource Mgmt Network Caching Tier Data 1 adaptation Memory copies load Servlets Apache App IPC Res Mgmt IPC Tier 1 Intra-Node Intra-Tier Inter-Tier State Sharing

6 State Sharing in Datacenters Several applications employ their own data management protocols maintain versions of stored data synchronization primitives Issues Datacenter Services frequently exchange System load, system state, locks Cached data Ad-hoc messaging protocols for exchanging data/resource Same data/resource at multiple places (e.g., load information, data) Protocols used are typically TCP/IP, IPC mechanisms, memory copies, etc Performance may depend on the back-end load Scalability issues

7 InfiniBand, 10 Gigabit Ethernet High-Performance Networks High-Performance Low latency (< 1 usecs) and high bandwidth (> 32 Gbps with QDR adapters) Novel features One-sided RDMA and atomics, multicast, QoS OpenFabrics alliance ( Common stack for several networks including iwarp (LAN/WAN)

8 Datacenter Research at OSU Existing Datacenter Components Active Resource Adaptation Reconfiguration Resource Monitoring Dynamic Content Caching Active Cooperative Caching Caching QoS & Admission Control Advanced System Services Distributed Data/Resource Sharing Substrate Global Memory Soft Shared State Lock Manager Aggregator Advanced Service Primitives Sockets Direct Protocol Advanced Communication Protocols and Subsystems RDMA Atomics Multicast High-Performance Networks (InfiniBand, iwarp 10GigE) High-speed Networks Datacenter Homepage:

9 Distributed Data Sharing Substrate Datacenter Application Get Put Datacenter Application Datacenter Application Get Get Load Info System State Meta-data Data Put Put Datacenter Application Datacenter Services Datacenter Services

datacenters Applications are multi-threaded in nature Design Optimizations in state

10 Multicore Architectures Increased cores per-chip More parallelism available Intel, AMD Dual-core, quad-core 80-core systems are currently built Significant benefits for datacenters Applications are multi-threaded in nature Design Optimizations in state sharing mechanisms Opportunities for dedicating one or more cores Future multicore systems

11 Objective Can we enhance the distributed data sharing substrate using the features of multicore architectures by dedicating one or more of the cores? How do these enhancements help in improving the overall performance with datacenter applications and services?

12 Presentation Outline Introduction and Motivation Distributed Data Sharing Substrate Proposed Design Optimizations Experimental Results Conclusions and Future Work

13 Distributed Data Sharing Substrate Use of a common service thread to get access to the shared state Applications get shared state information using the service thread Several design optimizations in communicating with the service thread Message Queues (MQ-DDSS) Memory mapped queues for request (RMQ-DDSS) Memory mapped queues for request and response (RCQ-DDSS)

14 Message Queue-based DDSS (MQ-DDSS) Application Threads IPC_Recv Service Thread Produce Consume Request Queue Consume NIC IPC_Recv IPC_Send IPC_Send Completion Queue Produce Interrupt Kernel Message Queues Event User Space Kernel Space Kernel Involvement Kernel Thread

15 Message Queue-based DDSS Kernel involvement IPC Send and Receive operations Communication Progress Limitations Several context-switches Interrupt overheads

16 Presentation Outline Introduction and Motivation Distributed Data Sharing Substrate Proposed Design Optimizations Experimental Results Conclusions and Future Work

17 Application Threads Request/Message Queue-based Request Queue Produce Consume Service Thread DDSS (RMQ-DDSS) Produce Consume Request Queue Consume NIC IPC_Recv Kernel Message Queues IPC_Send Completion Queue Produce User Space Kernel Space Kernel Involvement

18 Application Threads Request/Completion Queue-based Request Queue Produce Consume Service Thread DDSS (RCQ-DDSS) Produce Consume Request Queue Consume NIC Consume Completion Queue Produce Completion Queue Produce User Space No Kernel Involvement Kernel Space

19 RMQ-DDSS and RCQ-DDSS Schemes RMQ-DDSS scheme + Lesser number of interrupts and context-switches compared to MQ-DDSS + Improvement in response time as request is sent via memory mapped queues May occupy significant CPU RCQ-DDSS scheme + Avoids kernel involvement + Significant improvement in response time as request and response are sent via memory mapped queues May occupy more CPU as compared to RMQ-DDSS - apps & service thread need to poll on the completion queue

20 Presentation Outline Introduction and Motivation Distributed Data Sharing Substrate Proposed Design Optimizations Experimental Results Conclusions and Future Work

21 Experimental Testbed InfiniBand experiments 560-core cluster consisting of 70 compute nodes with dual 2.33 GHz Intel Xeon quad-core processors Mellanox MT25208 dual port HCA 10-Gigabit experiments Intel dual quad-core Xeon 3.0 GHz, 512 MB memory Chelsio T3B 10 GigE PCI-Express adapters OpenFabrics stack OFED 1.2 Experimental outline Microbenchmarks (performance and scalability) Application performance (R-Trees, B-Trees, STORM, checkpointing) Dedicating cores for datacenter services (resource monitoring)

22 IPC Latency (usecs) RCQ-DDSS scales with increasing client threads RCQ-DDSS performs better than RMQ-DDSS and MQ- DDSS Basic Performance of DDSS Number of Client Threads RCQ-DDSS RMQ-DDSS MQ-DDSS Latency (usecs) Latency (usecs) K 4K 16K Message Size (bytes) RCQ-DDSS RMQ-DDSS MQ-DDSS InfiniBand K 4K 16K Message Size (bytes) RCQ-DDSS RMQ-DDSS MQ-DDSS 10-Gigabit Ethernet

23 IPC Latency (usecs) Hybrid approach is required for scalability with large number of threads DDSS scales when keys are distributed Number of Client Threads RCQ-DDSS RMQ-DDSS MQ-DDSS Latency (usecs) DDSS Scalability Latency (usecs) Number of Client Threads Number of Client Threads RCQ-DDSS RMQ-DDSS MQ-DDSS Keys are on a single node RCQ-DDSS RMQ-DDSS MQ-DDSS Keys are distributed

24 Performance with R-Trees, B-Trees, STORM Time (msecs) Time (msecs) % 40% 60% 80% 100% 20% 40% 60% 80% 100% Records Accessed Records Accessed RTREE-RCQ-SS RTREE-MQ-SS RRTEE-RMQ-SS RTREE BTREE-RCQ-SS BTREE-MQ-SS BTREE-RMQ-SS BTREE MQ-SS shows significant improvement compared to traditional implementations but RCQ-SS shows marginal improvements compared to MQ-SS Time (msecs) K 100K 1000K Number of Records STORM-RCQ-SS STORM-RMQ-SS STORM-MQ-SS STORM

25 Data Sharing Performance in Applications Time (usecs) Time (usecs) % 40% 60% 80% 100% 20% 40% 60% 80% 100% Records Accessed Records Accessed RTREE-RCQ-DDSS RTREE-MQ-DDSS RRTEE-RMQ-DDSS RTREE BTREE-RCQ-DDSS BTREE-MQ-DDSS BTREE-RMQ-DDSS BTREE RCQ-DDSS shows significant improvement as compared to RMQ-DDSS and MQ-DDSS Time (milliseconds) K 10K 100K 1000K Number of Records STORM-RCQ-DDSS STORM-MQ-DDSS STORM-RMQ-DDSS STORM

26 Performance with checkpointing Execution Time (usecs) Number of Client Threads RCQ-DDSS RMQ-DDSS MQ-DDSS Clients on single node (non-distributed) Execution Time (usecs) Number of Client Threads RCQ-DDSS RMQ-DDSS MQ-DDSS Clients on diff node (non-distributed) Hybrid approach is required for scalability with large number of threads Latency (usecs) Number of Client Threads RCQ-DDSS RMQ-DDSS MQ-DDSS Clients on diff node (non-distributed)

27 Performance with Dedicated Cores Latency(Microseconds) Latency(Microseconds) Servers Servers 16Servers 32Servers Iterations 4Servers Servers 16Servers 32Servers Iterations Dedicating a core for resource monitoring can avoid up to 50% degradation in client response time

28 Conclusions & Future Work Proposed multicore optimizations for distributed data sharing substrate Evaluations with several applications shows significant improvement Showed the benefits of dedicating cores for services in datacenters Future work on dedicating other datacenter services, datacenter-specific operations

Web Pointers NBC-LAB Datacenter Homepage: http://nowlab.cse.ohio-state.

29 Web Pointers NBC-LAB Datacenter Homepage: s: {vaidyana, laipi, narravul,

Advanced RDMA-based Admission Control for Modern Data-Centers

Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline