Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

Similar documents
S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Advanced RDMA-based Admission Control for Modern Data-Centers

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Dynamic Reconfigurability Support for providing Soft QoS Guarantees in Cluster-based Multi-Tier Data-Centers over InfiniBand

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing High Performance DSM Systems using InfiniBand Features

Sockets Direct Procotol over InfiniBand in Clusters: Is it Beneficial?

Workload-driven Analysis of File Systems in Shared Multi-tier Data-Centers over InfiniBand

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Performance Evaluation of InfiniBand with PCI Express

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Performance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC

Performance Evaluation of InfiniBand with PCI Express

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers

Memcached Design on High Performance RDMA Capable Interconnects

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Unified Runtime for PGAS and MPI over OFED

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol (SDP) over InfiniBand

High Performance MPI on IBM 12x InfiniBand Architecture

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers

Unifying UPC and MPI Runtimes: Experience with MVAPICH

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

* Department of Computer Science Jackson State University Jackson, MS 39217

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Whitepaper / Benchmark

High-Performance Key-Value Store on OpenSHMEM

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Memory Management Strategies for Data Serving with RDMA

Diffusion TM 5.0 Performance Benchmarks

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Intel Enterprise Processors Technology

Benefits of I/O Acceleration Technology (I/OAT) in Clusters

Summary Cache based Co-operative Proxies

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

Mark Falco Oracle Coherence Development

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

POWER7: IBM's Next Generation Server Processor

Infiniband and RDMA Technology. Doug Ledford

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

RDMA for Memcached User Guide

Demotion-Based Exclusive Caching through Demote Buffering: Design and Evaluations over Different Networks

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Introduction. Architecture Overview

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Flash: an efficient and portable web server

Application Acceleration Beyond Flash Storage

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

Evaluation of Strong Consistency Web Caching Techniques

POWER7: IBM's Next Generation Server Processor

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

A Practical Scalable Distributed B-Tree

Architecture of a Real-Time Operational DBMS

Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

A Case for High Performance Computing with Virtual Machines

1-1. Switching Networks (Fall 2010) EE 586 Communication and. September Lecture 10

ETHOS A Generic Ethernet over Sockets Driver for Linux

Systems Architecture II

Benefits of Dedicating Resource Sharing Services in Data-Centers for Emerging Multi-Core Systems

ABySS Performance Benchmark and Profiling. May 2010

Evaluation of Strong Consistency Web Caching Techniques

Key Measures of InfiniBand Performance in the Data Center. Driving Metrics for End User Benefits

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

MM5 Modeling System Performance Research and Profiling. March 2009

Design and Implementation of MPICH2 over InfiniBand with RDMA Support

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage. Massive Distributed Storage over InfiniBand RDMA

Memory-Based Cloud Architectures

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Enhancing Checkpoint Performance with Staging IO & SSD

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Seminar on. By Sai Rahul Reddy P. 2/2/2005 Web Caching 1

Transcription:

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda The Ohio State University

Presentation Outline Introduction/Motivation Design and Implementation Experimental Results Conclusions

Introduction Fast Internet Growth Number of Users Amount of data Types of services Several uses E-Commerce, Online Banking, Online Auctions, etc Types of Content Images, documents, audio clips, video clips, etc - Static Content Stock Quotes, Online Stores (Amazon), Online Banking, etc. - Dynamic Content (Active

Presentation Outline Introduction/Motivation Multi-Tier Data-Centers Active Caches InfiniBand Design and Implementation Experimental Results Conclusions

Multi-Tier Data-Centers Single Powerful Computers Clusters Low Cost to Performance Ratio Increasingly Popular Multi-Tier Data-Centers Scalability an important issue

A Typical Multi-Tier Data-Center Web Servers Apache Clients Tier 0 Proxy Nodes Tier 2 Database Servers WAN Tier 1 Application Servers PHP

Tiers of a Typical Multi-Tier Data-Center Proxy Nodes Handle Caching, load balancing, security, etc Web Servers Handle the HTML content Application Servers Handle Dynamic Content, Provide Services Database Servers Handle persistent storage

Data-Center Characteristics Front-End Tiers Computation Back-End Tiers The amount of computation required for processing each request increases as we go to the inner tiers of the Data-Center Caching at the front tiers is an important factor for scalability

Presentation Outline Introduction/Motivation Introduction Multi-Tier Data-Centers Active Caches InfiniBand Design and Implementation Experimental Results Conclusions

Caching Can avoid re-fetching of content Beneficial if requests repeat Static content caching Well studied in the past Widely used Number of Requests Decrease Front-End Tiers Back-End Tiers

Active Caching Dynamic Data Stock Quotes, Scores, Personalized Content, etc Simple caching methods not suited Issues Consistency Coherency User Request Proxy Node Cache Back-End Data Update

Cache Consistency Non-decreasing views of system state Updates seen by all or none Proxy Nodes Back-End Nodes User Requests Update

Cache Coherency Refers to the average staleness of the document served from cache Two models of coherence Bounded staleness (Weak Coherency) Strong or immediate (Strong Coherency)

Strong Cache Coherency An absolute necessity for certain kinds of data Online shopping, Travel ticket availability, Stock Quotes, Online auctions Example: Online banking Cannot afford to show different values to different concurrent requests

Caching policies No Caching Client Polling Invalidation * TTL/Adaptive TTL Consistency Coherency *D. Li, P. Cao, and M. Dahlin. WCIP: Web Cache Invalidation Protocol. IETF Internet Draft, November 2000.

Presentation Outline Introduction/Motivation Introduction Multi-Tier Data-Centers Active Caches InfiniBand Design and Implementation Experimental Results Conclusions

InfiniBand High Performance Low latency High Bandwidth Open Industry Standard Provides rich features RDMA, Remote Atomic operations, etc Targeted for Data-Centers Transport Layers VAPI IPoIB SDP

Performance 140 Latency 900 Throughput Latency (us) 120 100 80 60 40 20 IPoIB SDP VAPI Throughput (MB/s) 800 700 600 500 400 300 200 100 IPoIB SDP VAPI 0 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Message Size 0 4 16 64 256 1K 4K 16K 64K Message Size Low latencies of less than 5us achieved Bandwidth over 840 MB/s * SDP and IPoIB from Voltaire s Software Stack

Performance Throughput (RDMA Read) Throughput (Mbps) 7000 6000 5000 4000 3000 2000 1000 25 20 15 10 5 0 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (bytes) 0 Send CPU Throughput (Poll) Recv CPU Throughput (Event) Receiver side CPU utilization is very low Leveraging the benefits of One sided communication

Caching policies No Caching Client Polling Invalidation TTL/Adaptive TTL Consistency Coherency

Objective To design an architecture that very efficiently supports strong cache coherency on InfiniBand

Presentation Outline Introduction/Motivation Design and Implementation Experimental Results Conclusions

Basic Architecture External modules are used Module communication can use any transport Versioning: Application servers version dynamic data Version value of data passed to front end with every request to back-end Version maintained by front end along with cached value of response

Mechanism Cache Hit: Back-end Version Check If version current, use cache Invalidate data for failed version check Cache Miss Get data to cache Initialize local versions

Architecture Front-End Back-End Request Cache Hit Response Cache Miss

Design Every server has an associated module that uses IPoIB, SDP or VAPI to communicate VAPI: When a request arrives at proxy, VAPI module is contacted. Module reads latest version of the data from the back-end using one-sided RDMA Read operation If versions do not match, cached value is invalidated

VAPI Architecture Front-End Back-End Request Cache Hit RDMA Read Response Cache Miss

Implementation Socket-based Implementation: IPoIB and SDP are used Back-end version check is done using two-sided communication from the module Requests to read and update are mutually excluded at the back-end module to avoid simultaneous readers and writers accessing the same data. Minimal changes to existing software

Presentation Outline Introduction/Motivation Design and Implementation Experimental Results Data-Center Throughput Data-Center Response Time Data-Center Break-up Zipf and WC Trace Throughput Conclusions

Experimental Test-bed Eight Dual 2.4GHz Xeon processor nodes 64-bit 133MHz PCI-X interfaces 512KB L2-Cache and 400MHz Front Side Bus Mellanox InfiniHost MT23108 Dual Port 4x HCAs MT43132 eight 4x port Switch SDK version 0.2.0 Firmware version 1.17

Data-Center: Performance DataCenter: Throughput Transactions per second (TPS) 2500 2000 1500 1000 500 0 0 10 20 30 40 50 60 70 80 90 100 200 Number of Compute Threads No Cache IPoIB VAPI SDP The VAPI module can sustain performance even with heavy load on the back-end servers

Data-Center: Performance Datacenter: Response Time Response time (ms) 10 9 8 7 6 5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 100 200 Number of Compute Threads NoCache IPoIB VAPI SDP The VAPI module responds faster even with heavy load on the back-end servers

Response Time Breakup Response Time Splitup - 0 Compute Threads Response Time Splitup - 200 Compute Threads 8 8 Time (ms) 7 6 5 4 3 IPoIB SDP VAPI Time (ms) 7 6 5 4 3 IPoIB SDP VAPI 2 2 1 1 0 Client Communication Proxy Processing Module Processing Backend version check 0 Client Communication Proxy Processing Module Processing Backend version check Worst case Module Overhead less than 10% of the response time Minimal overhead for VAPI based version check even for 200 compute threads

Data-Center: Throughput Throughput: ZipF distribution ThroughPut: World Cup Trace Transactions per Second (TPS) 800 700 600 500 400 300 200 100 0 0 10 20 30 40 50 60 70 80 90 100 200 Number of Compute Threads Transactions Per Second (TPS) 3000 2500 2000 1500 1000 500 0 0 10 20 30 40 50 60 70 80 90 100 200 Number of Compute Threads No Cache IPoIB VAPI SDP NoCache IPoIB VAPI SDP The drop in the throughput of VAPI in World cup trace is due to the higher penalty for cache misses under increased load VAPI implementation does better for real trace too

Conclusions An architecture for supporting Strong Cache Coherence External module based design Freedom in choice of transport Minimal changes to existing software Sockets API inherent limitation Two-sided communication High performance Sockets not the solution (SDP) Main benefit One sided nature of RDMA calls

Web Pointers NBC home page http://nowlab.cis.ohio-state.edu/ E-mail: {narravul, balaji, vaidyana, savitha, wuj, panda} @cis.ohio-state.edu