Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory (NBCL) Computer Science and Engineering Ohio State University

Introduction and Motivation Interactive Data-driven Applications Scientific as well as Enterprise/Commercial Applications Static Datasets: Medical Imaging Modalities Dynamic Datasets: Stock value datasets, E-commerce, Sensors E-science Ability to interact with, synthesize and visualize large datasets Data-centers enable such capabilities Clients initiate queries (over the web) to process specific datasets Data-centers process data and reply to queries 04/26/06 D. K. Panda (The Ohio State University)

Typical Multi-Tier Data-center Environment Clients WAN WAN Proxy Server Web-server (Apache) More Computation and Communication Requirements Storage Application Server (PHP) Database Server (MySQL) Requests are received from clients over the WAN Proxy nodes perform caching, load balancing, resource monitoring, etc. If not cached, the request is forwarded to the next tiers Application Server Application server performs the business logic (CGI, Java servlets, etc.) Retrieves appropriate data from the database to process the requests

Limitations of Current Data-centers Communication Requirements TCP/IP used even in the data-center: Sub-optimal performance InfiniBand and other interconnects provide more features High Performance Sockets (e.g., SDP) Superior performance with no modifications Advanced Data-center Services Minimize the computation requirements Improved caching of documents Issues with caching Dynamic (or Active) Content Maximize compute resource utilization Efficient resource monitoring and management Issues with heterogeneous load characteristics of data-centers

Proposed Architecture Existing Data-Center Components Dynamic Content Caching Active Resource Adaptation Advanced System Services Soft Shared State Point To Point Distributed Lock Manager Global Memory Aggregator Data-Center Service Primitives Sockets Direct Protocol Packetized Flow-control Async. Zero-copy Communication Advanced Communication Protocols and Subsystems Protocol Offload RDMA Atomic Multicast Network

Presentation Layout Introduction and Motivation Advanced Communication Protocols and Subsystems Data-center Service Primitives Dynamic Content Caching Services Active Resource Adaptation Services Conclusions and Ongoing Work

The Sockets Protocol Stack Application App #1 App #2 App #N Sockets Interface Sockets Interface Traditional Sockets TCP Traditional Sockets TCP High Performance Sockets (e.g., SDP) IP Device Driver IP Device Driver Lower-level Interface High-speed Network Berkeley Sockets Implementation Advanced Features High-speed Network Offloaded Protocol The Sockets Protocol Stack allows applications to utilize the network performance and capabilities with NO or MINIMAL modifications

InfiniBand and Features An emerging open standard high performance interconnect High Performance Data Transfer Interprocessor communication and I/O Low latency (~1.0-3.0 microsec), High bandwidth (~10-20 Gbps) and low CPU utilization (5-10%) Flexibility for WAN communication Multiple Operations Send/Recv RDMA Read/Write Atomic Operations (very unique) high performance and scalable implementations of distributed locks, semaphores, collective communication operations Range of Network Features and QoS Mechanisms Service Levels (priorities) Virtual lanes Partitioning Multicast allows to design a new generation of scalable communication and I/O subsystem with QoS

SDP Latency and Bandwidth 70 Latency 60 900 Unidirectional Bandwidth 160 60 50 800 140 Latency (usec) 50 40 30 20 40 30 20 CPU Utilization % Bandwidth (Mpbs) 700 600 500 400 300 120 100 80 60 CPU Utilization % 10 10 200 100 40 20 0 2 4 8 16 32 64 128 256 512 1K 2K 4K 0 0 4 16 64 256 1K 4K 16K 64K 0 Message Size (Bytes) Message Size (Bytes) TCP/IP CPU TCP/IP Native IBA SDP CPU SDP TCP/IP CPU TCP/IP Native IBA SDP CPU SDP Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial?, P. Balaji, S. Narravula, K. Vaidyanathan, K. Savitha, D. K. Panda. IEEE International Symposium on Performance Analysis and Systems (ISPASS), 04.

Zero-Copy Communication for Sockets Buffer 1 Application Blocks Buffer 2 Application Blocks Sender Send Send Complete Send Send Complete SRC AVAIL Get Data GET COMPLETE SRC AVAIL Get Data GET COMPLETE Receiver Buffer 1 Buffer 2

Asynchronous Zero-Copy SDP Buffer 1 Buffer 2 Sender Send Memory Protect Send Memory Protect SRC AVAIL Receiver Get Data Memory Unprotect Memory Unprotect GET COMPLETE Buffer 1 Buffer 2

Throughput and Comp./Comm. Overlap 12000 Throughput 10000 Comp./Comm. Overlap Throughput (Mbps) 10000 8000 6000 4000 BSDP ZSDP AZ-SDP Throughput (Mbps) 9000 8000 7000 6000 5000 4000 3000 BSDP ZSDP AZSDP 2000 2000 1000 0 1 4 16 64 256 1K 4K 16K Message Size (Bytes) 64K 256K 1M 0 0 20 40 60 80 100 Delay (usec) 120 140 160 180 200 Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol (SDP) over InfiniBand. P. Balaji, S. Bhagvat, H. W. Jin and D. K. Panda. Workshop on Communication Architecture for Clusters (CAC); with IPDPS 06.

Data-Center Service Primitives Common Services needed by Data-Centers Better resource management Higher performance provided to higher layers Service Primitives Soft Shared State Distributed Lock Management Global Memory Aggregator Network Based Designs RDMA, Remote Atomic Operations

Soft Shared State Data-Center Application Get Put Data-Center Application Data-Center Application Get Shared State Put Data-Center Application Get Put Data-Center Application Data-Center Application

Dynamic data caching challenging! Cache Consistency and Coherence Become more important than in static case Active Caching Proxy Nodes Back-End Nodes User Requests Update

Active Cache Design Efficient mechanisms needed RDMA based design Load resiliency Our cooperation protocols No-Dependency Invalidate-All Client Polling based design

RDMA based Client Polling Design Front-End Back-End Request Response Cache Hit Version Read Response Cache Miss

Active Caching - Performance 16000 Data-Center Throughput 16000 Effect of Load Throughput 14000 12000 10000 8000 6000 No Cache Invalidate All Dependency Lists Throughput 14000 12000 10000 8000 6000 No Cache Dependency Lists 4000 4000 2000 2000 0 0 Trace 2 Trace 3 Trace 4 Trace 5 0 1 2 4 8 16 32 64 Traces with Increasing Update Rate Load (Compute Threads) Higher overall performance Up to an order of magnitude Performance is sustained under loaded conditions Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand. S. Narravula, P. Balaji, K. Vaidyanathan, H. -W. Jin and D. K. Panda. CCGrid-2005

Multi-tier Cooperative Caching RDMA based schemes Effective use of system-wide memory from across multiple tiers Significant performance benefits Our Schemes BCC, CCWR, MTACC and HYBCC Up to 2-3 times compared to the base case Improvement Ra Performance Improvement 3 2.5 2 1.5 1 0.5 0 BCC CCWR MTACC HYBCC 8k 16k 32k 64k S. Narravula, H. -W. Jin, K. Vaidyanathan and D. K. Panda, Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks. IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 06).

Active Resource Adaptation Increasing popularity of Shared data-centers How to decide the number of proxy nodes vs. application servers vs. database servers Current approach Use a rigid configuration Over-Provisioning Active Resource Adaptation Reconfigure nodes from one tier to another tier Allocate resources based on system load and traffic pattern Meet QoS and Prioritization constraints Load Resiliency

Active Resource Adaptation in Shared Data- Centers Load Balancing Cluster (Site A) Servers Website A (low priority) Clients Clients WAN Load Balancing Cluster (Site B) Hard QoS Maintained Servers Website B (medium priority) Load Balancing Cluster (Site C) Servers Website C (high priority) Reconf-PQ reconfigures nodes for different websites but also guarantees fixed number of nodes to low priority requests

Active Resource Adaptation Design Server Website A Load Balancer Server Website B Not Loaded Load Query RDMA RDMA Loaded Load Query Successful Atomic (Lock) Successful Atomic (Update Counter) Reconfigure Node Successful Atomic (Unlock) Load Shared Load Shared

Dynamic Reconfigurability using RDMA operations 60000 Throughput 100% QoS Meeting Capability TPS 50000 40000 30000 20000 10000 % of QoS Met 80% 60% 40% 20% 0 1K 2K 4K 8K 16K 0% Case 1 Case 2 Case 3 Rigid Reconf Over-Provisioning Reconf Reconf-P Reconf-PQ On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data- Centers over InfiniBand. `P. Balaji, S. Narravula, K. Vaidyanathan, H. W. Jin and D. K. Panda. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 05.

Conclusions Proposed a novel framework for data-centers to address the current limitations Low performance due to high communication overheads Lack of efficient support of advanced features such as active caching, dynamic resource adaptation, etc Three-layer Architecture Communication Protocol Support Data-Center Primitives Data-Center Services Novel approaches using the advanced features of InfiniBand Resilient to the load on the back-end servers Order of magnitude performance gain for several scenarios

Work-in-Progress Data-Center Primitives Efficient System-Wide Soft Shared State Mechanisms Efficient Distributed Lock Manager Mechanisms Fine-Grained Active Resource Adaptation Fine-grain resource monitoring Resource adaptation with database servers and multi-stage reconfigurations Detailed Data-Center Evaluation with the proposed framework

Web Pointers NBCL Website: http://www.cse.ohio-state.edu/~panda Group Homepage: http://nowlab.cse.ohio-state.edu Email: panda@cse.ohio-state.edu