PostgreSQL Entangled in Locks:

Similar documents
A Batch of Commit Batching Peter Geoghegan and Greg Smith 2ndQuadrant

Heckaton. SQL Server's Memory Optimized OLTP Engine

Faster and Durable Hash Indexes

Major Features: Postgres 10

Distributed Transaction Manager

CSE 530A ACID. Washington University Fall 2013

A Scalable Lock Manager for Multicores

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Anthony AWR report INTERPRETATION PART I

The Google File System

RocksDB Key-Value Store Optimized For Flash

A look at the elephants trunk

Inside the PostgreSQL Shared Buffer Cache

Data Processing on Modern Hardware

Benchmarking In PostgreSQL

Ext4 Filesystem Scaling

Distributed Transaction Manager. Stas Kelvich Konstantin Knizhnik Konstantin Pan

Main-Memory Databases 1 / 25

CS122 Lecture 15 Winter Term,

What's new in MySQL 5.5? Performance/Scale Unleashed

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Hardware Intel Core I5 and above 4 GB RAM LAN Connectivity 500 MB HDD (Free Space)

Enabling and Optimizing MariaDB on Qualcomm Centriq 2400 Arm-based Servers

Chapter 16: Recovery System. Chapter 16: Recovery System

MySQL 8.0 Performance: InnoDB Re-Design

JANUARY 20, 2016, SAN JOSE, CA. Microsoft. Microsoft SQL Hekaton Towards Large Scale Use of PM for In-memory Databases

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

XtraDB 5.7: Key Performance Algorithms. Laurynas Biveinis Alexey Stroganov Percona

Deukyeon Hwang UNIST. Wook-Hee Kim UNIST. Beomseok Nam UNIST. Hanyang Univ.

Analysis of Derby Performance

Monitoring and Debugging PostgreSQL

Database Hardware Selection Guidelines

Buffer Management for XFS in Linux. William J. Earl SGI

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

SpanFS: A Scalable File System on Fast Storage Devices

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Request-Oriented Durable Write Caching for Application Performance appeared in USENIX ATC '15. Jinkyu Jeong Sungkyunkwan University

Chapter 17: Recovery System

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

Realtime Search with Lucene. Michael

Read-Copy Update (RCU) Don Porter CSE 506

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Background. Let s see what we prescribed.

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

PostgreSQL Performance Tuning. Ibrar Ahmed Senior Software Percona LLC PostgreSQL Consultant

VMware vrealize operations Management Pack FOR. PostgreSQL. User Guide

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Concurrency Control Goals

CSE 544 Principles of Database Management Systems

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

PostgreSQL as a benchmarking tool

High Performance Transactions in Deuteronomy

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

CISC 7310X. C08: Virtual Memory. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 3/22/2018 CUNY Brooklyn College

Declarative Partitioning Has Arrived!

The Google File System

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

Chapter 10: Case Studies. So what happens in a real operating system?

Lock Tuning. Concurrency Control Goals. Trade-off between correctness and performance. Correctness goals. Performance goals.

Multiprocessor Systems. COMP s1

CLOUD-SCALE FILE SYSTEMS

Scaling PostgreSQL on SMP Architectures

A Basic Snooping-Based Multi-processor

IMPORTANT: Circle the last two letters of your class account:

Panu Silvasti Page 1

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Windows Persistent Memory Support

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions

Computer Science 146. Computer Architecture

CSE506: Operating Systems CSE 506: Operating Systems

Distributed PostgreSQL with YugaByte DB

Logical disks. Bach 2.2.1

ScaleFS: A Multicore-Scalable File System. Rasha Eqbal

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

2. PICTURE: Cut and paste from paper

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Advanced Topic: Efficient Synchronization

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

PGCluster-II. Clustering system of PostgreSQL using Shared Data. Atsushi MITANI. PGCon 2007

Multiprocessor Systems. Chapter 8, 8.1

Quiz II Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall Department of Electrical Engineering and Computer Science

The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008

CS 318 Principles of Operating Systems

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

NPTEL Course Jan K. Gopinath Indian Institute of Science

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Concurrency Control II and Distributed Transactions

PostgreSQL clusters using streaming replication and pgpool-ii

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Looking for Ways to Improve the Performance of Ext4 Journaling

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

OPS-23: OpenEdge Performance Basics

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

ECE 669 Parallel Computer Architecture

STEPS Towards Cache-Resident Transaction Processing

File Systems. CS170 Fall 2018

Huge market -- essentially all high performance databases work this way

Speculative Synchronization

CSL373/CSL633 Major Exam Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

Transcription:

PostgreSQL Entangled in Locks: Attempts to free it PGCon 2017 26.05.2017 - Amit Kapila - Dilip Kumar 2013 EDB All rights reserved. 1

Overview Background Effects of locking on scalability Past approaches Existing issues in lock contention Read-only bottlenecks C-Hash for Buffer mapping lock Read-write Bottlenecks WAL write lock Clog control lock Snapshot caching for ProcArrayLock 2017 EnterpriseDB Corporation. All rights reserved. 2

Effects of locking Locks have been a major scalability issue in PostgreSQL. In the past a lot of work has been done to remove these bottlenecks. Fastpath relation lock. Change locking regimen around buffer replacement. Lockless StrategyGetBuffer in clock sweep. ProcArrayLock contention removal by group commit. Buffer header spin lock to atomic operation. Hash Header Lock Contention. 2017 EnterpriseDB Corporation. All rights reserved. 3

Empirical evidence for read-only bottlenecks(1/3) Performance on Intel 8 socket machine (128 core) Performance Data on Commit f58b664393dcfd02c2f57b3ff20fc0 aee6dfebf1. If data doesn t fit in shared buffers, performance is falling after 64 clients. When data fits in shared buffers, it s falling after 128 clients. 2017 EnterpriseDB Corporation. All rights reserved. 4

Empirical evidence for read-only bottlenecks(2/3) Experimental Setup: Wait event test Workload: pgbench readonly Hardware: Intel 8 socket machine (128 core with HT) Scale Factor : 5000 Run duration : 120 s Wait Event Count Wait Event Type Wait Event Name 39919 LWLock buffer_mapping 5119 Client ClientRead 3116 IO DataFileRead 558 Activity WalWriterMain 2017 EnterpriseDB Corporation. All rights reserved. 5

Empirical evidence for read-only bottlenecks(3/3) Experimental setup: Test with perf Workload: pgbench readonly Hardware: Intel 8 socket machine (128 core with HT) Scale Factor : 300 Run duration : 120 s 2017 EnterpriseDB Corporation. All rights reserved. 6

Analysis for Bottlenecks in read-only(1/3) Wait event test shows that there is significant bottleneck on the BufferMappingLock. especially when data doesn t fit into shared buffers. There is also significant bottleneck in GetSnapshotData. Perf shows significant time spent in taking the snapshot. 2017 EnterpriseDB Corporation. All rights reserved. 7

Analysis for Bottlenecks in read-only(2/3) BufferMappingLock This is used to protect the shared buffer mapping hash table. We acquire it in exclusive lock mode to insert an entry for the new block during buffer replacement and in shared mode to find the existing buffer entry. This lock is partitioned for concurrency. 2017 EnterpriseDB Corporation. All rights reserved. 8

Analysis for Bottlenecks in read-only(3/3) GetSnapshotData In read-committed transaction isolation, we need to take the snapshot for every query. There is an extra overhead in computing snapshot every time. There is contention on ProcArrayLock as we compute snapshot under that lock. 2017 EnterpriseDB Corporation. All rights reserved. 9

Experiments for reducing read-only bottleneck. C-Hash (Concurrent hash table) for removing buffer mapping lock contention. Snapshot caching for reducing the overhead of GetSnapshotData. 2017 EnterpriseDB Corporation. All rights reserved. 10

C-Hash Lock-free hash table which works using memory barriers and atomic operations. Lookups are done without any locks, only memory barriers. Inserts and deletes are done using atomic ops. Delete just mark the node as deleted but doesn t make it reusable immediately. When a backend wishes to move entries from a garbage list to a free list, it must first wait for any backend scanning that garbage list to complete their scans. 2017 EnterpriseDB Corporation. All rights reserved. 11

Experimental evaluation: results(1/2) Experimental setup Pgbench read only test Scale factor 5000 Shared buffers 8GB Intel 8 socket machine 2017 EnterpriseDB Corporation. All rights reserved. 12

Experimental evaluation: results(2/2) Experimental setup Wait Event Test on Intel 8 machine Socket. Pgbench readonly test. Scale Factor: 5000 Run duration: 120s On Head Event Count Event Type Event Name 39919 LWLock buffer_mapping 5119 Client ClientRead 3116 IO DataFileRead With C-Hash Event Count Event Type Event Name 3102 Client ClientRead 633 IO DataFileRead 199 Activity WalWriterMain 2017 EnterpriseDB Corporation. All rights reserved. 13

Experimental evaluation: conclusion After C-Hash patch it s scaling up to 128 clients and maximum gain of > 150% at higher clients. Wait event test shows that the contention on the buffer mapping lock is completely gone. There is 8-10% regression at one client. 2017 EnterpriseDB Corporation. All rights reserved. 14

Cache the snapshot (1/2) Calculate the snapshot once and reuse it across the backends till it s valid. Once the snapshot is calculated cache it into the shared memory. Next time any backend tries to calculate the snapshot, first check the snapshot cache. 2017 EnterpriseDB Corporation. All rights reserved. 15

Cache the snapshot(2/2) If a valid snapshot is available, reuse it. Otherwise, calculate the new snapshot and store it into the cache. ProcArrayEndTransaction will invalidate the cached snapshot. 2017 EnterpriseDB Corporation. All rights reserved. 16

Experimental evaluation: results Experimental setup Pgbench read only test Scale factor 300 Shared buffers 8GB Intel 8 socket machine 2017 EnterpriseDB Corporation. All rights reserved. 17

Experimental evaluation: conclusion After the patch, it can scale up to 128 clients. Observed >40% gain at higher client counts. Perf shows significant reduction in GetSnapshotData. 2017 EnterpriseDB Corporation. All rights reserved. 18

Empirical evidence for read-write bottlenecks(1/2) Experimental setup Test on Intel 8 socket machine Scale factor 300 Shared buffers 8GB sync_commit=on and sync_commit=off 2017 EnterpriseDB Corporation. All rights reserved. 19

Empirical evidence for read-write bottlenecks(2/2) Wait event test at 128 clients. sync_commit on sync_commit off 11607 IPC ProcArrayGroupUpdate 8477 LWLock WALWriteLock 7996 Client ClientRead 2671 Lock transactionid 557 LWLock wal_insert 10005 IPC ProcArrayGroupUpdate 9430 LWLock CLogControlLock 6352 Client ClientRead 5480 Lock transactionid 1368 LWLock wal_insert 2017 EnterpriseDB Corporation. All rights reserved. 20

Analysis for Bottlenecks in read-write Wait event shows that ProcArrayGroupUpdate is on the top. And also shows significant contentions on WALWriteLock. With sync commit off, WALWriteLock is reduced and it shows the next contention on ClogControlLock. 2017 EnterpriseDB Corporation. All rights reserved. 21

WAL Write Lock(1/3) This lock is acquired to write and flush the WAL buffer data to disk. during commit. during writing dirty buffers, if WAL is already not flushed. periodically by WALWriter. Generally, we observe very high contention around this lock during read-write workload. 2017 EnterpriseDB Corporation. All rights reserved. 22

WAL Write Lock(2/3) Experiments to find overhead Removed WAL write and flush calls. The TPS for read-write pgbench test is increased from 27871 to 45068 (at 300 scale factor with 64 clients). Tested with fsync off. The TPS for read-write pgbench test is increased from 27871 to 41835 (at 300 scale factor with 64 clients). 2017 EnterpriseDB Corporation. All rights reserved. 23

WAL Write Lock(3/3) Split WAL write and flush operations Take WAL flush calls out of WALWriteLock and perform them under a new lock ( WALFlushLock ). This should allow simultaneous os writes when a fsync is in progress. LWLockAcquireOrWait is used for the newly introduced WAL Flush Lock to accumulate flush calls. 2017 EnterpriseDB Corporation. All rights reserved. 24

Group flush the WAL(1/2) Each proc will advertise its write location and add itself to the pending_flush_list. The first backend that sees the list as empty (leader), will proceed to flush the WAL for all the procs in the pending_flush_list. The leader backend will acquire the LWLock and traverse the pending_flush_list to find the highest write location to flush. 2017 EnterpriseDB Corporation. All rights reserved. 25

Group flush the WAL(2/2) The leader backend will flush the WAL up to highest noted write location. After flush, it wakes up all the procs for which it has flushed the WAL. 2017 EnterpriseDB Corporation. All rights reserved. 26

Experimental evaluation: results Experimental setup Intel 2 socket machine (56 cores) Pgbench Read-write test Scale factor 1000 Sync_commit off Shared buffer 14GB 2017 EnterpriseDB Corporation. All rights reserved. 27

Experimental evaluation: conclusion Combining both the approaches group flushing and separating lock shows some improvement (~15-20%) at higher client counts. Haven t noticed a very big improvement with any of the approaches independently. Some more tests can be performed for larger WAL records to see the impact of combined writes. 2017 EnterpriseDB Corporation. All rights reserved. 28

Clog Control Lock(1/2) Wait event test shows huge contention around this lock at high client count (>64) especially with mixed workload. This lock is acquired In exclusive mode to update the transaction status into the CLOG. In exclusive mode to load the CLOG page into memory. In shared mode to read the transaction status from the CLOG. 2017 EnterpriseDB Corporation. All rights reserved. 29

Clog Control Lock(2/2) The contention around this lock happens due to Multiple processes contends with each other when simultaneously writing the transaction status in the CLOG. Processes writing the transaction status in the CLOG contends with the processes reading the transaction status from the CLOG. Both the above contentions together lead to high contention around CLOGControlLock in read-write workloads. 2017 EnterpriseDB Corporation. All rights reserved. 30

Group update the transaction status in Clog Solution used for ClogControlLock is similar to the ProcArray Group Clear XID. One backend will become the group leader and that process will be responsible for updating the transaction status for rest of the group members. This reduces the contention on the lock significantly. 2017 EnterpriseDB Corporation. All rights reserved. 31

Experimental evaluation: results(1/2) Experimental setup Pgbench Read-write test Scale factor 300 Sync_commit off Shared buffer 8GB 2017 EnterpriseDB Corporation. All rights reserved. 32

Experimental evaluation: results(2/2) Wait event test at 128 clients sync commit off. On Head Event Count Event Type Event Name 10005 IPC ProcArrayGroupUpdate With Group update Clog Event Count Event Type Event Name 18671 IPC ProcArrayGroupUpdate 9430 LWLock CLogControlLock 10014 LWLock ClientRead 6352 Client ClientRead 6115 Client transactionid 5480 Lock transactionid 2552 Lock wal_insert 1368 LWLock wal_insert 687 LWLock ProcArrayLock 2017 EnterpriseDB Corporation. All rights reserved. 33

Experimental evaluation: conclusion On the head, performance is falling after 64 clients whereas, with the patch, it can scale up to 128 clients. ~50% performance gain at higher clients. Wait event test shows significant reduction in contention on ClogControlLock. 2017 EnterpriseDB Corporation. All rights reserved. 34

Conclusion Pgbench test shows significant bottleneck exist in BuffermappingLock at higher scale factor for read-only workload. C-Hash shows >150% gain at higher clients. Snapshot caching shows >40% gain when data fits into shared buffers. Read-write test show bottleneck on ProcArrayGroupUpdate, WALWriteLock and ClogControlLock. Group flush along with taking the WAL flush calls out of WALWriteLock shows ~20% gain. Clog group update shows ~50% gain. 2017 EnterpriseDB Corporation. All rights reserved. 35

Thank You 2017 EnterpriseDB Corporation. All rights reserved. 36