Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Similar documents
Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Advanced RDMA-based Admission Control for Modern Data-Centers

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

FaRM: Fast Remote Memory

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Multiprocessor Support

Dynamic Reconfigurability Support for providing Soft QoS Guarantees in Cluster-based Multi-Tier Data-Centers over InfiniBand

Designing High Performance DSM Systems using InfiniBand Features

No compromises: distributed transactions with consistency, availability, and performance

Improving Scalability of Processor Utilization on Heavily-Loaded Servers with Real-Time Scheduling

IsoStack Highly Efficient Network Processing on Dedicated Cores

Engineering Robust Server Software

EE382 Processor Design. Processor Issues for MP

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

6.9. Communicating to the Outside World: Cluster Networking

Curriculum 2013 Knowledge Units Pertaining to PDC

Memory-Based Cloud Architectures

1. Memory technology & Hierarchy

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

SMP/SMT. Daniel Potts. University of New South Wales, Sydney, Australia and National ICT Australia. Home Index p.

Computer-System Organization (cont.)

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

Current Topics in OS Research. So, what s hot?

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Multitasking and Multithreading on a Multiprocessor With Virtual Shared Memory. Created By:- Name: Yasein Abdualam Maafa. Reg.

High-Performance Key-Value Store on OpenSHMEM

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Mark Falco Oracle Coherence Development

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

CS533 Concepts of Operating Systems. Jonathan Walpole

MATRIX:DJLSYS EXPLORING RESOURCE ALLOCATION TECHNIQUES FOR DISTRIBUTED JOB LAUNCH UNDER HIGH SYSTEM UTILIZATION

arxiv: v2 [cs.db] 21 Nov 2016

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Hybrid MPI - A Case Study on the Xeon Phi Platform

Benefits of Dedicating Resource Sharing Services in Data-Centers for Emerging Multi-Core Systems

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive

The End of a Myth: Distributed Transactions Can Scale

Review: Creating a Parallel Program. Programming for Performance

COREMU: a Portable and Scalable Parallel Full-system Emulator

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Blackfin Optimizations for Performance and Power Consumption

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization

Multifunction Networking Adapters

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Commercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory

The Google File System

Technical Computing Suite supporting the hybrid system

A Distributed Hash Table for Shared Memory

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Portland State University ECE 588/688. Graphics Processors

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

A Global Operating System for HPC Clusters

Chapter 5. Thread-Level Parallelism

Improving I/O Bandwidth With Cray DVS Client-Side Caching

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Parallel Computing Platforms

Multiprocessor Systems. Chapter 8, 8.1

Best Practices for Setting BIOS Parameters for Performance

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Using RDMA for Lock Management

Birds of a Feather Presentation

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

CSE502: Computer Architecture CSE 502: Computer Architecture

Messaging Overview. Introduction. Gen-Z Messaging

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Memory Management Strategies for Data Serving with RDMA

Network-Aware Resource Allocation in Distributed Clouds

Handout 3 Multiprocessor and thread level parallelism

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Replacement Policy: Which block to replace from the set?

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Page 1. Cache Coherence

Systems Architecture II

Research on the Implementation of MPI on Multicore Architectures

Concurrent Preliminaries

Advanced Computer Networks. End Host Optimization

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

CSE 120 Principles of Operating Systems

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Transcription:

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen

Outline Architecture of Web-based Data Center Three-Stage framework to benefit from InfiniBand Optimize Communication Protocol Data-Center Service Primitives Dynamic Content Caching Active Resource Adaptation

Architecture of Web-based Data Center

Problems of Traditional Web-based Data Center TCP/IP Protocols have high latency, low bandwidth Two-sided communication incur CPU overhead at two sides. Low scalability of Strong Cache Coherence for Dynamic Content Caching Poor Service-Level Load-balancing Support to fully utilize limited physical Resource

Three-Stage Framework to Benefit from InfiniBand

Optimize Communication Protocol AZ_SDP (Asynchronous Zero-Copy SDP)

Data-Center Service Primitives Soft shared state primitive efficiently share information across cluster by creating a logical shared memory region using IBA s RDMA operation

Data-Center Service Primitives Soft shared state primitive efficiently share information across cluster by creating a logical shared memory region using IBA s RDMA operation

Dynamic Content Caching Client Polling Protocol Using RDMA read Coherent Invalidation The New Caching Design achieve 20% improvement for over-all data center throughput

Active Resource Adaptation

Active Resource Adaptation 8 nodes 14 nodes

Summary Proposed a three-layer framework AZSDP reduce communication overhead Soft State Primitives eases the sharing of information across cluster RDMA-based Dynamic Content Caching increase throughput RDMA-based Active Resource Adaptation Protocol

DDDS: A Low-Overhead Distributed Data Sharing Substrate for Cluster- Based Data-Centers over Modern Interconnects Presented by: Jitong Chen

Outline The Design Goals of DDSS DDSS Framework Implementation Evaluation

The Design Goals of DDSS Allow efficient sharing of information across the cluster by creating a logical shared memory region Support local and remote allocation in the shared state Support the access, update and deletion of data for all threads in a transparent manner be resilient to load imbalances and should have minimal overheads to access to data

The Design Goals of DDSS Support a range of coherency models: Strict Coherence (obtain the most current version and excludes concurrent writes and reads) Write Coherence (obtain the most current version and excludes concurrent writes) Read Coherence (obtain the most current version and excludes concurrent reads) No Coherence Delta Coherence (data is no more than x versions stale) Temporal Coherence (data is no more than t time units stale)

Non-Coherent/Coherent Distributed Data Sharing

DDSS Framework

Implementation IPC: create a run-time daemon support user process or thread to access DDSS Data Placement: try to distribute allocations among different nodes to avoid NIC contention Data Access: use one-sided operations to access remote memory without interrupting the remote node

Implementation Locking Mechanism: use atomic operation Compare-and-Swap to acquire and check the status of locks Coherence Maintenance: use atomic operation Fetch-and-Add to update the version of every put() operation,

Implementation DDSS Interface:

Evaluation Micro benchmark Increasing clients accessing different portions from a single node using get()

Evaluation Dynamic reconfiguration

Evaluation Application-level evaluation

Supporting Strong Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand Presented by: Jitong Chen

Outline Architecture of Multi-Tier Data Center Web Cache Coherence Strong Cache Coherency Model Strong Cache Coherency Model over InfiniBand Experiment Results

Architecture of Multi-Tier Data Center

Web Cache Coherence Average staleness of the documents present in the cache, i.e., the time elapses between the current time and the time of the last update of the document in the back-end. Strong Coherence means average staleness is zero. i.e., a client get the same response whether a request is answered from cache or from the back-end.

Strong Cache Coherency Model

Strong Cache Coherency Model

Strong Cache Coherency Model over InfiniBand

Experiment Results

Experiment Results

Summary RDMA operations provide low latency and high bandwidth communication between tiers in data center One sides communication provided by native InfiniBand leave more CPU free for the data center nodes to perform other operations. When Application Server is busy, one sided communication doesn t require much CPU to request coherence status from back-end tier, therefore cache verification is not slowed down too much even the application server is heavily loaded.

Thank You!