Rsyslog: going up from 40K messages per second to 250K. Rainer Gerhards

Similar documents
CS370 Operating Systems

Multiprocessor Systems. Chapter 8, 8.1

Grand Central Dispatch

Summary: Open Questions:

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?

Modern Processor Architectures. L25: Modern Compiler Design

Memory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

Multiprocessor Systems. COMP s1

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

An Introduction to Parallel Programming

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

All Paging Schemes Depend on Locality. VM Page Replacement. Paging. Demand Paging

CSE544 Database Architecture

Optimising MPI for Multicore Systems

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Chapter 9: Virtual Memory

CSE 544: Principles of Database Systems

Page 1. Goals for Today" TLB organization" CS162 Operating Systems and Systems Programming Lecture 11. Page Allocation and Replacement"

Chapter 8. Operating System Support. Yonsei University

Linked Lists: The Role of Locking. Erez Petrank Technion

Martin Kruliš, v

Cs703 Current Midterm papers solved by Naina_Mailk. Q: 1 what are the Characteristics of Real-Time Operating system?

CSE 120 Principles of Operating Systems

CSE 333 Lecture 9 - storage

Building a Scalable Evaluation Engine for Presto. Florian Zitzelsberger Pixar Animation Studios

Chapter 9: Virtual Memory. Operating System Concepts 9 th Edition

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Processes & Threads. Process Management. Managing Concurrency in Computer Systems. The Process. What s in a Process?

CS 261 Fall Mike Lam, Professor. Virtual Memory

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

RSYSLOGD(8) Linux System Administration RSYSLOGD(8)

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

CS510 Advanced Topics in Concurrency. Jonathan Walpole

CS 370 The Pseudocode Programming Process D R. M I C H A E L J. R E A L E F A L L

Chapter 8: Virtual Memory. Operating System Concepts

A Review on Cache Memory with Multiprocessor System

What's new in MySQL 5.5? Performance/Scale Unleashed

Ext3/4 file systems. Don Porter CSE 506

Page Replacement. (and other virtual memory policies) Kevin Webb Swarthmore College March 27, 2018

Chapter 9: Virtual-Memory

AFS Performance. Simon Wilkinson

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

CS604 Final term Paper Fall (2012)

Linux multi-core scalability

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

VIRTUAL MEMORY READING: CHAPTER 9

(Refer Slide Time: 1:26)

ECE 574 Cluster Computing Lecture 8

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 15: Caching: Demand Paged Virtual Memory

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Multicore Strategies for Games. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

Processes and Threads

Lecture 23 Database System Architectures

Intel Thread Building Blocks

OS Structure. Kevin Webb Swarthmore College January 25, Relevant xkcd:

Virtual Memory. Overview: Virtual Memory. Virtual address space of a process. Virtual Memory. Demand Paging

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Advanced Topic: Efficient Synchronization

Chapter 20: Database System Architectures

15 Sharing Main Memory Segmentation and Paging

Page Replacement Chap 21, 22. Dongkun Shin, SKKU

CS 153 Design of Operating Systems Winter 2016

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Chapter 8: Virtual Memory. Operating System Concepts Essentials 2 nd Edition

Chap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Reminder: Mechanics of address translation. Paged virtual memory. Reminder: Page Table Entries (PTEs) Demand paging. Page faults

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Virtual Memory. ICS332 Operating Systems

Threads. Computer Systems. 5/12/2009 cse threads Perkins, DW Johnson and University of Washington 1

CS420: Operating Systems

CS370 Operating Systems

Processor speed. Concurrency Structure and Interpretation of Computer Programs. Multiple processors. Processor speed. Mike Phillips <mpp>

416 Distributed Systems. Distributed File Systems 2 Jan 20, 2016

Why do we care about parallel?

CS 153 Design of Operating Systems Winter 2016

The Implications of Multi-core

CS370 Operating Systems

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

Copyright 2013 Thomas W. Doeppner. IX 1

Parallel Programming: Background Information

Parallel Programming: Background Information

Computer Fundamentals: Operating Systems, Concurrency. Dr Robert Harle

Optimisation. CS7GV3 Real-time Rendering

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Introduction to Parallel Performance Engineering

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Older geometric based addressing is called CHS for cylinder-head-sector. This triple value uniquely identifies every sector.

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Transcription:

Rsyslog: going up from 40K messages per second to 250K Rainer Gerhards

What's in it for you? Bad news: will not teach you to make your kernel component five times faster Perspective user-space application going from single core to multi core Lessons learned things we got wrong how we improved the situation our experiences hopefully useful for other userspace applications as well

So what is rsyslog? modern syslog message processor Forked from sysklogd Some initial coding started 2003 Single-threaded design and pretty old code But it worked! Really got momentum when Fedora looked for a new syslogd in 2007 has become the de-facto standard on most distributions

Rsyslog project... Design goals around 2004 Drop-in replacement for sysklogd Easy to use for simple cases Powerful for complex cases High performance and support for tomorrows multi-core machines Very heavy hacking in 2007 and 2008 Many, many features added No time to consolidate them

How does rsyslog relate to other apps? Rsyslog actually is a message router processing mostly independend somewhat similar objects within a type of pipeline. This makes rsyslog, and its problems, similar to many other (server) applications.

Performance Optimization Project rsysog deployed in high demanding data centers early v4 could handle 40K mps scaled very badly on multiple cores Project goals speedup processing of single message improve scalability phase one winter/spring 2009 focus of this talk resulted in up to 250K mps as reported by some users

Classes of Optimizations Traditional optimizations Refactoring Memory-subsystem based optimizations Concurrency-related optimizations

Traditional Optimizations The boring stuff, still useful to look at... C strings vs. Counted Strings Operating System Calls beware of context switches! Buffer Sizes More Specific Algorithms don t let seldom-used features constrain oftenused ones use specific (fast) code for common cases

Code Refactoring time to deliver was initially dominant few external reviewers own review, found lots to change, e.g. Unnecessary parameter formatting due to interface changes Unnecessarily deep function nesting due to functionality being shuffled between functions

Refactoring: Design Review e.g. the no worker really dumb case... worker pool management was very complex core design failure: we thought it would be useful to stop all workers when no work was done of course, that was wrong: keeping one blocking doesn t require many resources but restarting one does! We removed that capability and got faster and easier to maintain code with less bug potential More potential for this kind of refactoring, e.g. (overengineered) network driver layer

Memory-Subsystem: old ideas Access to memory is often considered equally fast to all memory locations for both reads and writes this builds the basis for (almost?) all academic reasonings on algorithm performance It often is assumed that aligned memory access is always faster than unaligned access

Memory Subsystem: today s reality Access time is very different depending on which memory is to access and when Writes are much slower than reads Unaligned access may be faster for some uses

Memory: important concepts locality spatial temporal working set minimum amount of memory needed to carry out a closely related set of activities for rsyslog: memory needed to receive, filter and output a message goal is to achieve spatial and temporal locality for the working set!

Memory: malloc subsystem try to reduce number of malloc calls malloc instead of calloc using stack instead of heap where possible (but makes memory debugging much more difficult) (somewhat) larger malloc s are OK fixed buffers instead of malloc use common size for fixed alloc inside structure malloc only if actual size is larger great for small elements (< 8Byte ptr size!)

Memory: keep related things together Fixed buffers (as shown on last slide) Structure packing Use bit fields where appropriate (but only then) but move unrelated things away from each other when written to by different threads (counters!) otherwise cache thrashing may severely affect performance

Memory: reuse memory regions improved buffer management to make it most likely that a memory region is continously being accessed by the same thread properties objects that keep their value for a relativly long time (many messages) allocated and written once, read (very) often reference counted

Concurrency paradigm shift: software must exploit concurrency directly, single core does not get much faster rsyslog started deploying multi-threading very early, with some (dumb, again ;-) ) mistakes made

Problem seen in Practice Lock contention limited performance and decreased performance when adding additional threads with fast output processing because lock contention dramatically increased locks then needed to go to kernel space, what became the dominating performance factor

Rsyslog Design (rough sketch) Concurreny: Each Input Queue Workers Output Modules (potentially)

Classical User Perception of syslog sequential assumes that sequence of messages in log store equals sequence of events

Root Cause: Usual Assumptions are invalid! storage sequence does not reflect event sequence buffering due to unavailble target system interim systems (including network reordering) multithreading on any sender or receiver scheduling order in short: sequence can only be preserved in a toally sequential system, which we do not have (and do not want!)

So, what s the solution to Sequence? use a kind of timestamp / order relation high-precision timestamps inside messages timestamps with sequence numbers Lamport Clocks (no implementation so far) then, process logs according to the selected order relation bottom line: sequence does not need to be preserved at the syslogd level, because it cannot do so!

How this affects rsyslog... single most important fact in respect to rsyslog design and performance rsyslog s initial design tried to preserve message order as much as possible severely blocked partitioning of workload performance optimization gained benefits from this insight now, almost everything could be done highly concurrent! (most) often invisible to user users who don t like it, can turn it off

Workload Partitioning process messages in batches of many instead of individually reduces number of mutex calls dramatically reduces lock contention even more (less likely) positive side effects an other items as well kind of temporal partitioning multiple main message queues inputs can submit messages to defined queues totally independent queues no locking contention at all between queues

Locking Improvements Simplified locking primitives removed need for recursive mutexes evaluated code and selected fastest locking method that did the job Atomic operations replaced locking for simple cases (counters) will become more important when lock/waitfreedom is addressed in third tuning effort winter 2010/11

Some other Things moved functionality to different pipeline stages utilizing different levels of concurrency example: message parsing from input stage to main queue worker thread reduce hidden looks some subsystems guard operations by locking calling them thus serializes processing sample: malloc subsystem, other libraries as well

Architecture after Redesign

Are we done now? no, definitely not still scales far from linear for large number of cores second tuning effort done in spring 2010 brought another speedup of four focussed on common use cases first exploration of lock-free algorithms third effort planned for winter/spring 2010/11 primary focus will be lock-freedom hopefully will come close to near-linear speedup

Conclusion We often needed to look at a very fine-grained level to achive high-level improvements We did some of the usual stuff, refactored some anomalies of a fast growing project, took a close look at modern hardware, but most importantly needed to break with traditional perception.

Most important lesson learned Re-evaluating current practice and questioning old habits is probably a key ingredient of moving from the mostly sequential programming paradigm to the fully concurrent one demanded by current and future hardware.

Many thanks for your attention Questions? rgerhards@hq.adiscon.com