Advanced RDMA-based Admission Control for Modern Data-Centers

Similar documents
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Designing High Performance DSM Systems using InfiniBand Features

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers

Benefits of Dedicating Resource Sharing Services in Data-Centers for Emerging Multi-Core Systems

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Dynamic Reconfigurability Support for providing Soft QoS Guarantees in Cluster-based Multi-Tier Data-Centers over InfiniBand

Performance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Workload-driven Analysis of File Systems in Shared Multi-tier Data-Centers over InfiniBand

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Memcached Design on High Performance RDMA Capable Interconnects

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Memory Management Strategies for Data Serving with RDMA

Sockets Direct Procotol over InfiniBand in Clusters: Is it Beneficial?

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Intra-MIC MPI Communication using MVAPICH2: Early Experience

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers

Performance Evaluation of InfiniBand with PCI Express

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Application Acceleration Beyond Flash Storage

High Performance MPI on IBM 12x InfiniBand Architecture

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Performance of HPC Middleware over InfiniBand WAN

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

IsoStack Highly Efficient Network Processing on Dedicated Cores

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Unified Runtime for PGAS and MPI over OFED

Performance Evaluation of InfiniBand with PCI Express

Enhancing Checkpoint Performance with Staging IO & SSD

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

A Case for High Performance Computing with Virtual Machines

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol (SDP) over InfiniBand

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Unifying UPC and MPI Runtimes: Experience with MVAPICH

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

The NE010 iwarp Adapter

Benefits of I/O Acceleration Technology (I/OAT) in Clusters

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Xen Network I/O Performance Analysis and Opportunities for Improvement

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

High-Performance Key-Value Store on OpenSHMEM

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Performance Evaluation of Virtualization Technologies

Capriccio : Scalable Threads for Internet Services

High-Performance Training for Deep Learning and Computer Vision HPC

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

LAMMPSCUDA GPU Performance. April 2011

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

RDMA for Memcached User Guide

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Future Routing Schemes in Petascale clusters

High Performance Java Remote Method Invocation for Parallel Computing on Clusters

Supporting iwarp Compatibility and Features for Regular Network Adapters

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Benefits of full TCP/IP offload in the NFS

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

Single-Points of Performance

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

NAMD Performance Benchmark and Profiling. January 2015

Solutions for Scalable HPC

Transcription:

Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University

Outline Introduction & Motivation Proposed Design Experimental Results Conclusions & Future Work 2

Internet grows Introduction Number of users, various type of services, huge amount of data Typical apps: e-commerce, bio-informatics, online banking etc. Web-based multi-tier data-centers Huge bursts of requests server overloaded Clients pay for service guaranteed QoS } Web Servers Proxy (Apache) Servers Multi-tier Data-center Database Servers (MySQL) Efficient admission control needed! Clients WAN WAN Application Servers (PHP) Tier 0 Tier 1 Tier 2 3

General Admission Control What is admission control? determine whether to accept/drop the incoming requests while guaranteeing the performance (or QoS requirements) of some already existing connections in the overloaded situation Typical approaches Internal approach: on the overloaded servers External approach: on the fronttier nodes. Main advantages are: Make global decisions More transparent to the overloaded servers Easily applicable to any tier Front-tier servers Admission Control Admission Control Admission Control Admission Control Overloaded servers Admission Control Admission Control Admission Control Back-end servers 4

Motivation External approach Front-tier proxy servers need to get load information from back-end servers Problems with the existing designs Use TCP/IP coarse-grained and high overhead; responsiveness depends on load Workload is divergent and unpredictable require finegrained and low overhead 5

Opportunity & Objective Opportunity: modern high-speed interconnects iwarp/10-gigabit Ethernet, InfiniBand, Quadrics etc. High performance: low latency & high bandiwidth Novel features: atomic operation, protocol offloading, RDMA operations etc. RDMA: low latency & no communication overhead on the remote node Objective Leverage the advanced features (RDMA operation) to design more efficient, lower overhead and better QoS guaranteed admission control 6

Outline Introduction & Motivation Proposed Design Experimental Results Conclusions & Future Work 7

System Architecture Load gathering daemon running on overloaded web servers Load monitoring daemon running on front-tier proxy servers Admission control module running on front-tier proxy servers Client Proxy Servers Web/App Servers (overloaded) Client Client WAN WAN Apache Admission Control Module Server 1 Load Gathering Daemon Back-tier Servers Applicable to any tier IPC Load Monitoring Daemon RDMA read Server N Load Gathering Daemon 8

Load Gathering and Monitoring Daemon Load gathering daemon Running on each of the overloaded servers in background low overhead Gather instantaneous load information Load monitoring daemon Running on each of the front-tier proxy servers Retrieve load information from all the load gathering daemons Communication is important! TCP/IP is not good, so? 9

Gathering and Monitoring Daemon Cont. Front-tier Server Overloaded Server service monitoring threads thread Memory registered buffer RDMA read gathering thread service threads Memory registered buffer Use RDMA read Monitoring daemon issues RDMA read to gathering daemon Buffer must be registered and pinned down before the operation Monitoring daemon has to know the memory address of the remote buffer Retrieve load information at high granularity under overload better decisions No CPU involvement on the loaded servers low overhead 10

Admission Control Module Use shared memory to communicate with load monitoring daemon Attach to Apache: dynamically loadable; trap into Apache request processing New processing procedure Apache main thread call the admission control module after TCP connection is established Admission control module uses weighted score to make decisions If all of the back-end servers are overloaded, call back to Apache thread to close the new connections; otherwise, call back to resume the processing request QoS Control Connection request QoS Status Admission Control Module Weighted score NO Can the system call back afford more requests? close connection Process/forward request request YES to web server load info. Load Monitor call back resume processing 11

Outline Introduction & Motivation Proposed Design Experimental Results Conclusions & Future Work 12

Experimental Platforms 32 Compute nodes Dual Intel 64-bit Xeon 3.6 GHz CPU, 2 GB memory Mellanox MT25208 InfiniBand HCA, OFED 1.2 driver Linux 2.6 Two-tier data-center including proxy servers and web servers; web servers are potentially overloaded Apache 2.2.4 for proxy servers and web servers 13

Experiment Results Outline Micro-benchmarks: basic IBA performance Data-center level evaluation Single file trace Average response time and aggregate TPS Instant performance analysis QoS analysis Worldcup trace and Zipf trace Worldcup trace: real data from world cup 1998 Zipf trace: workloads follow Zipf-like distribution (probability of i th most popular file ) 1 i α 14

Performance of RDMA read and IPoIB (TCP/IP over IBA) 30 With varying message size With background computation 25 500 Latency (usec) 20 15 10 5 Latency (usec) 400 300 200 100 0 1 4 16 64 256 1K 4K Message Size (Byte) 0 1 2 4 8 16 32 64 Number of Background Threads RDMA read IPoIB 64 Byte RDMA read 64 Byte IPoIB 1 Byte message RDMA read: 5.2 us IPoIB: 18.9 us Improvement using RDMA increases when message size increases IPoIB significantly degrades RDMA read keeps constant latency Performance of IPoIB depends on load, while RDMA NOT! 15

Data-Center level Evaluation Configuration 4 nodes used as proxy servers 1 node used as web server Remaining nodes are clients Load information updated every 1 ms Measured average client-perceived response time (for successful request) and aggregate system TPS Comparing performance of three systems AC-RDMA: system with admission control based on RDMA read (the proposed approach) AC-TCP/IP: system with admission control based on TCP/IP NoAC: system without any admission control 16

Performance with Single File Trace (16 KB) Average Response Time (msec) 200 180 160 140 120 100 80 60 40 20 Average Response Time Aggregate TPS 1800 1600 1400 1200 1000 800 600 400 200 Aggregate TPS 0 80 160 200 240 280 320 360 400 440 480 520 Number of Clients 0 80 160 200 240 280 320 360 400 440 480 520 Number of Clients AC-RDMA AC-TCP/IP NoAC AC-RDMA AC-TCP/IP NoAC With 520 clients NoAC: 192.31ms AC-TCP/IP: 142.29ms -26% improvement AC-RDMA: 105.03ms - 26% improvement over AC-TCP/IP (45% improvement over NoAC) AC-RDMA and AC-TCP/IP are comparable System with admission control has higher TPS than the original system 17

Instant Performance Workload: 400 clients Instant response time NoAC: many requests served with very long time AC-RDMA: almost no such requests AC-TCP/IP: some requests with long response time Instant performance is consistent with the trend of average response time 18

Instant Performance Cont. Instant drop rate AC-RDMA: closely reflects the instantaneous changing load on web servers AC-TCP/IP: longer streak of continuous drops or acceptance NoAC: a lot of acceptance; some drops because of timeout AC-RDMA gets the load information timely, while AC-TCP/IP sometimes reads the stale information due to the slow response from overloaded servers in TCP/IP communication 19

QoS Analysis Instant QoS status QoS Unsatisfaction Rate (%) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Average QoS Unsatisfaction 360 400 440 480 520 Number of Client AC-RDMA AC-TCP/IP NoAC Instant QoS status AC-RDMA has much better capability of satisfying the QoS requirement Average QoS unsatisfaction AC-RDMA is the best With the same requirement of QoS (e.g., response time), AC- RDMA can serve more clients than the other two systems 20

Performance with Worldcup and Zipf Trace 200 World Cup Trace 200 Zipf Trace Average Response Time (msec) 180 160 140 120 100 80 60 40 20 0 80 160 200 240 280 320 360 400 440 480 520 Number of Clients Average Response Time (msec) 180 160 140 120 100 80 60 40 20 0 80 160 200 240 280 320 360 400 440 480 520 Number of Clients AC-RDMA AC-TCP/IP NoAC AC-RDMA AC-TCP/IP NoAC AC-RDMA is better Compared to AC-TCP/IP: 17% Compared to NoAC: 36% AC-RDMA is better Compared to AC-TCP/IP: 23% Compared to NoAC: 42% 21

Outline Introduction & Motivation Proposed Design Experimental Results Conclusions & Future Work 22

Conclusions & Future Work Leveraged RDMA read in designing efficient admission control mechanism used in multi-tier data-centers Implemented the design in a two-tier data-center over InfiniBand Evaluated with single file, worldcup trace and Zipf trace AC-RDMA outperforms AC-TCP/IP up to 28%, outperforms NoAC up to 51% AC-RDMA can provide better QoS satisfaction Main reasons Update load information timely No extra overhead on the already overloaded servers Future work: study the scalability performance, incorporate other earlier work for integrated resource management service etc. 23

Overall Datacenter Framework Existing Datacenter Components Active Resource Adaptation Reconfiguration Soft Shared State Resource Monitoring Dynamic Content Caching Active Caching Distributed Data/Resource Sharing Substrate Lock Manager Cooperative Caching QoS & Admission Control Global Memory Aggregator Advanced System Services Advanced Service Primitives Sockets Direct Protocol Advanced Communication Protocols and Subsystems RDMA Atomics Multicast High-Performance Networks (InfiniBand, iwarp 10GigE) High-speed Networks 24

Thank you {laipi, narravul, vaidyana, panda}@cse.ohio-state.edu NBC-LAB Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ Data-Center Web Page http://nowlab.cse.ohio-state.edu/projects/data-centers/index.html 25