Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Similar documents
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Advanced RDMA-based Admission Control for Modern Data-Centers

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol (SDP) over InfiniBand

Dynamic Reconfigurability Support for providing Soft QoS Guarantees in Cluster-based Multi-Tier Data-Centers over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Benefits of I/O Acceleration Technology (I/OAT) in Clusters

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Designing High Performance DSM Systems using InfiniBand Features

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

The NE010 iwarp Adapter

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Workload-driven Analysis of File Systems in Shared Multi-tier Data-Centers over InfiniBand

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Memcached Design on High Performance RDMA Capable Interconnects

Message Passing Models and Multicomputer distributed system LECTURE 7

High Throughput WAN Data Transfer with Hadoop-based Storage

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

High Performance MPI on IBM 12x InfiniBand Architecture

Performance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC

Using RDMA for Lock Management

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Sockets Direct Procotol over InfiniBand in Clusters: Is it Beneficial?

Messaging Overview. Introduction. Gen-Z Messaging

IsoStack Highly Efficient Network Processing on Dedicated Cores

status Emmanuel Cecchet

FaRM: Fast Remote Memory

Advanced Computer Networks. End Host Optimization

Application of SDN: Load Balancing & Traffic Engineering

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Application Acceleration Beyond Flash Storage

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers

The Exascale Architecture

High-Performance Broadcast for Streaming and Deep Learning

Designing and Enhancing the Sockets Direct Protocol (SDP) over iwarp and InfiniBand

Memory Management Strategies for Data Serving with RDMA

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Lessons learned from MPI

Design and Performance Evaluation of a New Spatial Reuse FireWire Protocol. Master s thesis defense by Vijay Chandramohan

Introduction to Infiniband

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Optimizing non-blocking Collective Operations for InfiniBand

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Infiniband Fast Interconnect

Unified Runtime for PGAS and MPI over OFED

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

A Case for High Performance Computing with Virtual Machines

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Future Routing Schemes in Petascale clusters

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Performance Evaluation of InfiniBand with PCI Express

Flexible Architecture Research Machine (FARM)

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Mark Falco Oracle Coherence Development

CS533 Concepts of Operating Systems. Jonathan Walpole

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Birds of a Feather Presentation

Topic 6: SDN in practice: Microsoft's SWAN. Student: Miladinovic Djordje Date:

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

The Future of Interconnect Technology

Can High Performance Software DSM Systems Designed With InfiniBand Features Benefit from PCI-Express?

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Portland State University ECE 588/688. Graphics Processors

High Speed Asynchronous Data Transfers on the Cray XT3

EE382 Processor Design. Processor Issues for MP

Performance and Scalability with Griddable.io

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Transcription:

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory (NBCL) Computer Science and Engineering Ohio State University

Introduction and Motivation Interactive Data-driven Applications Scientific as well as Enterprise/Commercial Applications Static Datasets: Medical Imaging Modalities Dynamic Datasets: Stock value datasets, E-commerce, Sensors E-science Ability to interact with, synthesize and visualize large datasets Data-centers enable such capabilities Clients initiate queries (over the web) to process specific datasets Data-centers process data and reply to queries 04/26/06 D. K. Panda (The Ohio State University)

Typical Multi-Tier Data-center Environment Clients WAN WAN Proxy Server Web-server (Apache) More Computation and Communication Requirements Storage Application Server (PHP) Database Server (MySQL) Requests are received from clients over the WAN Proxy nodes perform caching, load balancing, resource monitoring, etc. If not cached, the request is forwarded to the next tiers Application Server Application server performs the business logic (CGI, Java servlets, etc.) Retrieves appropriate data from the database to process the requests

Limitations of Current Data-centers Communication Requirements TCP/IP used even in the data-center: Sub-optimal performance InfiniBand and other interconnects provide more features High Performance Sockets (e.g., SDP) Superior performance with no modifications Advanced Data-center Services Minimize the computation requirements Improved caching of documents Issues with caching Dynamic (or Active) Content Maximize compute resource utilization Efficient resource monitoring and management Issues with heterogeneous load characteristics of data-centers

Proposed Architecture Existing Data-Center Components Dynamic Content Caching Active Resource Adaptation Advanced System Services Soft Shared State Point To Point Distributed Lock Manager Global Memory Aggregator Data-Center Service Primitives Sockets Direct Protocol Packetized Flow-control Async. Zero-copy Communication Advanced Communication Protocols and Subsystems Protocol Offload RDMA Atomic Multicast Network

Presentation Layout Introduction and Motivation Advanced Communication Protocols and Subsystems Data-center Service Primitives Dynamic Content Caching Services Active Resource Adaptation Services Conclusions and Ongoing Work

The Sockets Protocol Stack Application App #1 App #2 App #N Sockets Interface Sockets Interface Traditional Sockets TCP Traditional Sockets TCP High Performance Sockets (e.g., SDP) IP Device Driver IP Device Driver Lower-level Interface High-speed Network Berkeley Sockets Implementation Advanced Features High-speed Network Offloaded Protocol The Sockets Protocol Stack allows applications to utilize the network performance and capabilities with NO or MINIMAL modifications

InfiniBand and Features An emerging open standard high performance interconnect High Performance Data Transfer Interprocessor communication and I/O Low latency (~1.0-3.0 microsec), High bandwidth (~10-20 Gbps) and low CPU utilization (5-10%) Flexibility for WAN communication Multiple Operations Send/Recv RDMA Read/Write Atomic Operations (very unique) high performance and scalable implementations of distributed locks, semaphores, collective communication operations Range of Network Features and QoS Mechanisms Service Levels (priorities) Virtual lanes Partitioning Multicast allows to design a new generation of scalable communication and I/O subsystem with QoS

SDP Latency and Bandwidth 70 Latency 60 900 Unidirectional Bandwidth 160 60 50 800 140 Latency (usec) 50 40 30 20 40 30 20 CPU Utilization % Bandwidth (Mpbs) 700 600 500 400 300 120 100 80 60 CPU Utilization % 10 10 200 100 40 20 0 2 4 8 16 32 64 128 256 512 1K 2K 4K 0 0 4 16 64 256 1K 4K 16K 64K 0 Message Size (Bytes) Message Size (Bytes) TCP/IP CPU TCP/IP Native IBA SDP CPU SDP TCP/IP CPU TCP/IP Native IBA SDP CPU SDP Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial?, P. Balaji, S. Narravula, K. Vaidyanathan, K. Savitha, D. K. Panda. IEEE International Symposium on Performance Analysis and Systems (ISPASS), 04.

Zero-Copy Communication for Sockets Buffer 1 Application Blocks Buffer 2 Application Blocks Sender Send Send Complete Send Send Complete SRC AVAIL Get Data GET COMPLETE SRC AVAIL Get Data GET COMPLETE Receiver Buffer 1 Buffer 2

Asynchronous Zero-Copy SDP Buffer 1 Buffer 2 Sender Send Memory Protect Send Memory Protect SRC AVAIL Receiver Get Data Memory Unprotect Memory Unprotect GET COMPLETE Buffer 1 Buffer 2

Throughput and Comp./Comm. Overlap 12000 Throughput 10000 Comp./Comm. Overlap Throughput (Mbps) 10000 8000 6000 4000 BSDP ZSDP AZ-SDP Throughput (Mbps) 9000 8000 7000 6000 5000 4000 3000 BSDP ZSDP AZSDP 2000 2000 1000 0 1 4 16 64 256 1K 4K 16K Message Size (Bytes) 64K 256K 1M 0 0 20 40 60 80 100 Delay (usec) 120 140 160 180 200 Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol (SDP) over InfiniBand. P. Balaji, S. Bhagvat, H. W. Jin and D. K. Panda. Workshop on Communication Architecture for Clusters (CAC); with IPDPS 06.

Presentation Layout Introduction and Motivation Advanced Communication Protocols and Subsystems Data-center Service Primitives Dynamic Content Caching Services Active Resource Adaptation Services Conclusions and Ongoing Work

Data-Center Service Primitives Common Services needed by Data-Centers Better resource management Higher performance provided to higher layers Service Primitives Soft Shared State Distributed Lock Management Global Memory Aggregator Network Based Designs RDMA, Remote Atomic Operations

Soft Shared State Data-Center Application Get Put Data-Center Application Data-Center Application Get Shared State Put Data-Center Application Get Put Data-Center Application Data-Center Application

Presentation Layout Introduction and Motivation Advanced Communication Protocols and Subsystems Data-center Service Primitives Dynamic Content Caching Services Active Resource Adaptation Services Conclusions and Ongoing Work

Dynamic data caching challenging! Cache Consistency and Coherence Become more important than in static case Active Caching Proxy Nodes Back-End Nodes User Requests Update

Active Cache Design Efficient mechanisms needed RDMA based design Load resiliency Our cooperation protocols No-Dependency Invalidate-All Client Polling based design

RDMA based Client Polling Design Front-End Back-End Request Response Cache Hit Version Read Response Cache Miss

Active Caching - Performance 16000 Data-Center Throughput 16000 Effect of Load Throughput 14000 12000 10000 8000 6000 No Cache Invalidate All Dependency Lists Throughput 14000 12000 10000 8000 6000 No Cache Dependency Lists 4000 4000 2000 2000 0 0 Trace 2 Trace 3 Trace 4 Trace 5 0 1 2 4 8 16 32 64 Traces with Increasing Update Rate Load (Compute Threads) Higher overall performance Up to an order of magnitude Performance is sustained under loaded conditions Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand. S. Narravula, P. Balaji, K. Vaidyanathan, H. -W. Jin and D. K. Panda. CCGrid-2005

Multi-tier Cooperative Caching RDMA based schemes Effective use of system-wide memory from across multiple tiers Significant performance benefits Our Schemes BCC, CCWR, MTACC and HYBCC Up to 2-3 times compared to the base case Improvement Ra Performance Improvement 3 2.5 2 1.5 1 0.5 0 BCC CCWR MTACC HYBCC 8k 16k 32k 64k S. Narravula, H. -W. Jin, K. Vaidyanathan and D. K. Panda, Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks. IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 06).

Presentation Layout Introduction and Motivation Advanced Communication Protocols and Subsystems Data-center Service Primitives Dynamic Content Caching Services Active Resource Adaptation Services Conclusions and Ongoing Work

Active Resource Adaptation Increasing popularity of Shared data-centers How to decide the number of proxy nodes vs. application servers vs. database servers Current approach Use a rigid configuration Over-Provisioning Active Resource Adaptation Reconfigure nodes from one tier to another tier Allocate resources based on system load and traffic pattern Meet QoS and Prioritization constraints Load Resiliency

Active Resource Adaptation in Shared Data- Centers Load Balancing Cluster (Site A) Servers Website A (low priority) Clients Clients WAN Load Balancing Cluster (Site B) Hard QoS Maintained Servers Website B (medium priority) Load Balancing Cluster (Site C) Servers Website C (high priority) Reconf-PQ reconfigures nodes for different websites but also guarantees fixed number of nodes to low priority requests

Active Resource Adaptation Design Server Website A Load Balancer Server Website B Not Loaded Load Query RDMA RDMA Loaded Load Query Successful Atomic (Lock) Successful Atomic (Update Counter) Reconfigure Node Successful Atomic (Unlock) Load Shared Load Shared

Dynamic Reconfigurability using RDMA operations 60000 Throughput 100% QoS Meeting Capability TPS 50000 40000 30000 20000 10000 % of QoS Met 80% 60% 40% 20% 0 1K 2K 4K 8K 16K 0% Case 1 Case 2 Case 3 Rigid Reconf Over-Provisioning Reconf Reconf-P Reconf-PQ On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data- Centers over InfiniBand. `P. Balaji, S. Narravula, K. Vaidyanathan, H. W. Jin and D. K. Panda. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 05.

Presentation Layout Introduction and Motivation Advanced Communication Protocols and Subsystems Data-center Service Primitives Dynamic Content Caching Services Active Resource Adaptation Services Conclusions and Ongoing Work

Conclusions Proposed a novel framework for data-centers to address the current limitations Low performance due to high communication overheads Lack of efficient support of advanced features such as active caching, dynamic resource adaptation, etc Three-layer Architecture Communication Protocol Support Data-Center Primitives Data-Center Services Novel approaches using the advanced features of InfiniBand Resilient to the load on the back-end servers Order of magnitude performance gain for several scenarios

Work-in-Progress Data-Center Primitives Efficient System-Wide Soft Shared State Mechanisms Efficient Distributed Lock Manager Mechanisms Fine-Grained Active Resource Adaptation Fine-grain resource monitoring Resource adaptation with database servers and multi-stage reconfigurations Detailed Data-Center Evaluation with the proposed framework

Web Pointers NBCL Website: http://www.cse.ohio-state.edu/~panda Group Homepage: http://nowlab.cse.ohio-state.edu Email: panda@cse.ohio-state.edu