May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

Similar documents
Understanding and Improving the Cost of Scaling Distributed Event Processing

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Light & NOS. Dan Li Tsinghua University

Chapter 13: I/O Systems

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

CSE398: Network Systems Design

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Input-Output (I/O) Input - Output. I/O Devices. I/O Devices. I/O Devices. I/O Devices. operating system must control all I/O devices.

Chapter 13: I/O Systems

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

PASTE: Fast End System Networking with netmap

Chapter 8: I/O functions & socket options

Principles of Operating Systems CS 446/646

I/O Management Intro. Chapter 5

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

IsoStack Highly Efficient Network Processing on Dedicated Cores

Input/Output Systems

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Module 12: I/O Systems

Chapter 13: I/O Systems

Chapter 13: I/O Systems. Chapter 13: I/O Systems. Objectives. I/O Hardware. A Typical PC Bus Structure. Device I/O Port Locations on PCs (partial)

Non-blocking Java Communications Support on Clusters

Module 11: I/O Systems

Advanced Computer Networks. End Host Optimization

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Multifunction Networking Adapters

Application Acceleration Beyond Flash Storage

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Chapter 13: I/O Systems

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

Virtualization, Xen and Denali

Capriccio : Scalable Threads for Internet Services

High Performance Java Remote Method Invocation for Parallel Computing on Clusters

GM8126 I2C. User Guide Rev.: 1.0 Issue Date: December 2010

Introduction. CS3026 Operating Systems Lecture 01

Chapter 13: I/O Systems

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

HLD For SMP node affinity

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

The control of I/O devices is a major concern for OS designers

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Chapter 13: I/O Systems

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

I/O Devices. I/O Management Intro. Sample Data Rates. I/O Device Handling. Categories of I/O Devices (by usage)

University of Tokyo Akihiro Nomura, Yutaka Ishikawa

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

The Common Communication Interface (CCI)

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

High Performance Packet Processing with FlexNIC

Input/Output Systems

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet

Processes and Threads

Indirect Communication

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Generic Model of I/O Module Interface to CPU and Memory Interface to one or more peripherals

COT 4600 Operating Systems Fall Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM

Review: Hardware user/kernel boundary

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Comp 310 Computer Systems and Organization

Module 12: I/O Systems

Supporting Fine-Grained Network Functions through Intel DPDK

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Custom UDP-Based Transport Protocol Implementation over DPDK

Computer Architecture

Distributed Information Processing

Advanced RDMA-based Admission Control for Modern Data-Centers

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

Simultaneous Multithreading on Pentium 4

I/O Systems. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

PASTE: A Network Programming Interface for Non-Volatile Main Memory

操作系统概念 13. I/O Systems

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

WebSphere MQ Low Latency Messaging V2.1. High Throughput and Low Latency to Maximize Business Responsiveness IBM Corporation

Extensible Network Security Services on Software Programmable Router OS. David Yau, Prem Gopalan, Seung Chul Han, Feng Liang

Improving Scalability of Processor Utilization on Heavily-Loaded Servers with Real-Time Scheduling

I/O. CS 416: Operating Systems Design Department of Computer Science Rutgers University

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

I/O Systems. Jo, Heeseung

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Sybase Adaptive Server Enterprise on Linux

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Execution architecture concepts

Networking in a Vertically Scaled World

Operating System Design Issues. I/O Management

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Answer to exercises chap 13.2

Transcription:

A Sleep-based Our Akram Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) May 1, 2011

Our 1 2 Our 3 4 5 6

Our Efficiency in Back-end Processing Efficiency in back-end processing is important. Scalability is important but software stacks of indiviual nodes are becoming complex : Runtime bloat (Nick Mitchell). Complex messaging protocols. Layers of software, libraries etc. This leads to over-provisioning of resources for back-end processing.

Our Recently gaining attention due to large amounts of data to be processed/filtered. Static queries and moving data. Similar operators like traditional data bases. Reasons for adopting a distributed model : Geographically distributed sources of data. Speed-up of application queries. Borealis (academic consortium) and SystemS (IBM) are common examples.

Key Requirements of Our Scalability to many nodes. Provisioning for heavy inter-node communication. Rich library of stream operators. protocol and operators should be decoupled.

The Architecture of Borealis - Event Structure Event-driven architetcure. The notion of streams and tuples. Our Tuples Number of Tuples Size of Tuple Source of Tuples Stream Info Tuple1 Tuple2 Tuple3 Tuple4 Time Stamp Field1 Field2 Field3

The Architecture of Borealis - Threads and Data Structures Our Four threads that work asynchronously: receive thread process thread prepare thread send thread Data structures for inter-thread communication.

Our Subsystems in Middleware Send/Receive operations are implemented using: Interrupts - High overhead at high network speeds and large message rates. Polling - Wastes CPU cycles at low network rates. Send/Receive API provided by Linux Sockets : Blocking sockets (interrupts). Non-blocking sockets (polling). Monitoring multiple sockets (blocking call to select). Problems with monitoring multiple sockets with select.

Our Sleeping - An Alternative Approach Sleep for a specific amount of time if no communication is expected. Regulation of sleeping time : Kernel issues. Multiple applications. Parameters of a single application changes. Granularity of sleeping time may change with a different kernel.

Our Our Approach: Distribution/Accumulation of Typical configuration of a data streaming system is a pipeline of senders/receivers. Send and receive threads work asynchronously. Goal of send thread : Node downstream has enough work to perform. Goal of receive thread : Unpack the events and give work to process thread. Layers above the communication protocol have enough work to do.

Our ing in Waves Both send and receive threads maintain messaging queues. The receive thread informs the send thread of the availability of free slots in the queue by sending a message (credit message). After processing a few buffers, the receive thread sends a credit message to the send thread. The credit message allows the send thread to send data in buffers that the receive thread has already made available. If there the send thread can not find a credit message, it sleeps.

Our ing in Waves The receive thread unpacks the events, hand the events to the event handler and then checks for an event in the next slot in the queue. If the receive thread can not find data in the buffer, it sleeps. While it is sleeping, the send thread fills up the queue with new events.

Our ing in Waves: Summary Sleeping criteria for send thread : Criteria: Sleep for a fixed amount of time if no credits available. Rationale: Receiver is busy unpacking messages and will send credits at some point. Sleeping criteria for receive thread : Criteria: Sleep for a fixed amount of time if no new message is available. Rationale: All the available messages were unpacked and distributed to layer above. Processing is much heavier than unpacking. Collect work while consuming no extra CPU cycles.

Our Machine Parameters and Benchmark for Evaluation Four server-type systems running Linux CentOS release 5.4. Two Intel Xeon Quad-core (2-way hyper threaded). 14 Gbytes DRAM. 10 Gbits/s Ethernet NIC from Myrinet. 10 Gbits/s Ethernet HP ProCurve 3400cl switch. A custom-benchmark that filters the incoming data (filter condition is always true to load network). First node generates the tuples, the next two process the tuples. The last node receives the tuples and consumes them internally.

Our Some Parameters of Borealis No. of instances of borealis (8). Batching factor (varying). Tuple size (varying). Size of send-side queue (10). Size of receive-side queue (100). Frequency of exchanging credits (every 10 buffers). Sleeping time is 10 ms.

Our Myrinet MX - A User-level Networking API Provides a user-level networking API. Baseline throughput is higher : Removes one copy on send side. Removes two copies on the receive path. Reduces the number of interrupts on the receive side. Fine-grained control for managing buffers. Ease of implementation of flow-control mechanisms.

Our Our Configurations of Borealis for Evaluation tcp : Baseline version of borealis with TCP/IP. mx-poll : Borealis with Myrinet MX protocol and polling operations for testing buffers. mx-int : Borealis with Myrinet MX protocol and polling opeations for testing buffers. mx-sleep : Borealis with Myrinet MX protocol and using sleep system call.

Baseline Throughput of Borealis with TCP and MyrinetMX ktuples/sec 600 400 200 0 ktuples/sec 500 400 300 200 100 0 tcp mx-poll mx-int mx-sleep Our 128 1K 16K event size (bytes) (a) 128 bytes 128 1K 16K event size (bytes) (b) 512 bytes mx-int improves throughput of borealis compared to tcp (22%). mx-poll has lower throughput compared to mx-int. mx-adp gives better throughput compared to tcp (23-63%).

Our ktuples/sec % utilization 600 400 200 0 128 1K Throughput and CPU Utilization - All Configurations 2K 4K 8K event size (bytes) (c) 128 bytes 80 60 40 20 0 128 1K 2K 4K 8K event size (bytes) (f) 128 bytes 16K 16K ktuples/sec % utilization 600 400 200 80 60 40 20 0 0 256 1K 2K 4K 8K event size (bytes) (d) 256 bytes 256 1K 2K 4K 8K event size (bytes) (g) 256 bytes 16K 16K ktuples/sec % utilization 80 60 40 20 500 400 300 200 100 0 0 512 512 1K 1K 2K 4K 8K event size (bytes) 16K (e) 512 bytes 2K 4K 8K event size (bytes) 16K (h) 512 bytes mx-poll mx-int mx-sleep mx-poll mx-int mx-sleep

General Trends in Writing Middleware Our Modules are written by different developers. Accounting for heterogenous architectures. Accounting for slow networks. Over-provisioning for memory.

Our General Trends in Writing Middleware Buffer management across threads/modules : (buffer ptr,size). Copying a buffer and passing it. Serialization-Deserialization - Heterogenity and Portability : among heterogenous nodes. Packing data-structures spread in different parts of memory. Overhead of copies. Use separate send operation to send each field of data-structure.

Our General Trends in Writing Middleware Message Queuing for Asynchronous Operation : Threads might block on slow networks. Buffering provides asynchronous operation. Not necessary on fast networks. Send the event from the prepare thread and block (in case). Flow Control : Memory is usually over-provisioned. Virtual memory is backed up by swap space on disk. Proper flow-control involves accounting memory under utilization (by different threads). Proper inter-thread flow-control saves memory resources for other tasks in the system. Different structures could possibly allow flow-control (which one to choose).

Our receive thread stack space 1 global buffer (rbuf) 2 new event 3a Observations from the Borealis Flow 3b process thread 4 prepare thread send thread 5a 5b 6 global buffer (wbuf) serialized event new event (string) 7

Our Sleep-based communication policies can save CPU cycles for other tasks. Main problem is to find a criteria to sleep. Portability is a concern. CPU cycles for a given application : Less power. Give CPU cycles to some other application : Improves (overall) energy efficiency of a system. Too much focus on scaling?

Our Sleep-based communication policies can save CPU cycles for other tasks. Main problem is to find a criteria to sleep. Portability is a concern. CPU cycles for a given application : Less power. Give CPU cycles to some other application : Improves (overall) energy efficiency of a system. Too much focus on scaling?