SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores Xiaoning Ding The Ohio State University dingxn@cse.ohiostate.edu Kaibo Wang The Ohio State University wangka@cse.ohiostate.edu Xiaodong Zhang The Ohio State University zhang@cse.ohiostate.edu ACM EuroSys 11, April 10 13, 2011, Salzburg, Austria. 2012/1/4

Outline Introduction Related work proposes Selected Region Mapping Buffer Implement prototype Conclusion

Introduction(1/3) CPU cache and operating system buffer cache are two critical layers to narrow the speed gap. We need a good cooperation between multicore architecture and increasingly large capacity of main memory Otherwise, severe performance degradation may be incurred.

Introduction(2/3) Buffer data usually have much weaker temporal localities than VM data VM-intensive(computation intensive): scientific application File-intensive: grep, tar

Introduction(3/3) because it can easily pollute the shared hardware cache(s) in the processor. Cache Core Core Core Core L1 L1 L1 L1 L2 L3(Last Level Cache) L2 Core i7 a thread accessing a large set of data cached in OS buffer may significantly slow down its co-runners Last level caches share among multicore Cache pollution: process load data in CPU cache which has few opportunities to be reused. VM page Buffer page Main Memory Disk

Introduction(3/3) CO run because it can easily pollute the shared hardware cache(s) in the processor. Cache Core Core Core Core L1 L1 L1 L1 L2 L3(Last thrashing Level Cache) L2 Core i7 a thread accessing a large set of data cached in OS buffer may significantly slow down its co-runners Last level caches share among multicore Cache pollution: process load data in CPU cache which has few opportunities to be reused. VM page Buffer page Main Memory Disk

Related Work(1/4) Cache-Memory address mapping: APP Virtual address Virtual page number Page offset OS Physical address Physical page number Page offset Hardwar e Cache address Cache tag Cache index Block offset

Related Work(2/4) Physical address in the page coloring technique: 1. Cache color: common bits between the cache set index and the physical page number. 2. Cache sets with the same color value form a cache region.

Related Work(3/4) For example, there are 64 different cache colors on Intel Xeon 5355 processor with a page size of 4KB.(2 12 ) 1. colors evenly divide the last level cache into 64 nonoverlapping regions 2. divide physical pages into 64 disjoint groups. If the physical memory size is 4GB(2 32 ), there are 16384pages(2 14 ) (i.e. 64MB physical memory for 4KB pages) in each color. Cache(64 colors) Memory 16384 pages

Related Work(4/4) cache pollution is restricted allocating physical pages in the same color Conventional OS buffer cache 1. File blocks are allocated with physical pages in random colors 2. Access OS buffer cache pollutes LLC very quickly Cache is polluted in small period of time

Select Region Mapping (SRM)Buffer(1/7) To control cache pollution, Minimize the number of colors of the physical pages assigned to OS buffer pages In an SRM-buffer, the application accesses a batch of buffer pages in color. Usually accessed in sequence (identifies groups of related blocks) Appropriate Sequence length:256pages

Select Region Mapping Buffer(2/7) Identify sequence(stream of file blocks) Blocks in the same sequence are mapped to the same cache region(same color when they are loaded to OS buffer) Change colors dynamically for difference sequences of blocks

Select Region Mapping Buffer(3/7) How to determine sequence(which block access together)? 1. Same file heuristic: blocks in the same file 2. same application heuristic: blocks consecutively loaded by same process

Select Region Mapping Buffer(4/7) how to coordinate different requirements of buffer management and virtual memory management on physical page allocation, while retaining a high hit ratio of OS page cache? 1. Hurt the hit ratio of page cache 2. Uneven color distribution among the physical pages available to virtual memory

Select Region Mapping Buffer(5/7) Structure: 1. Normal zone : managed by conventional OS buffer, contain LRU list (e.g. the active list and inactive list in Linux OS). 2. Color zone: free page and small amount of inactive pages 1. Pages in the colored zone into multiple lists 2. Each list links the pages in the same color 3. If a page in the colored zone is hit, it is moved to the normal zone 4. On page faults or OS buffer misses, SRM-buffer reclaims pages in the colored zone

Select Region Mapping Buffer(6/7) Buffer miss: allocate physical pages in a single color to file blocks loaded in a sequence Change colors dynamically after the number of pages allocated in a given color reaches a threshold VM page faults: uniformly allocate pages in different lists to hold VM page Page hit is retained automatically by normal zone

Select Region Mapping Buffer(7/7) Buffer miss(a sequence): Page refill Another sequence: Reach threshold, Dynamic change colors Page Allocation VM page fault: 1. first block in a sequence, SRM-buffer selects a list with more pages than threshold, and reclaims a page on the list. 2. When such a list cannot be found or a list becomes empty, SRM-buffer refills the colored zone.

Performance Evaluation(1/6) Experiment Setup: 1. Dell PowerEdge 1900 workstation with two 2.66GHz quad-core Xeon X5355 processors, 16GB RAM. 2. Dell Precision T1500 workstation with an Intel Core i7 860 processor, 8GB RAM. 3. The operating system is 64-bit Red Hat Enterprise Linux AS release 5. 4. The file system is ext3. 5. Linux kernel version 2.6.30. We used pfmon[hp Corp. 2010] to collect performance statistics such as last level cache misses

Performance Evaluation(2/6) we test SRM-buffer with a PostgreSQL database server [PostgreSQL 2008] supporting data warehouse workloads. The size of the fact table is about 4GiB Hash join Sequential scan Relative to solo run slowdown = T 2 T 1 T 1

Performance Evaluation(3/6) TPC-H benchmarks on PostgreSQL 1. First group: Q6,Q15(sequential scan) 2. Second group:q5,q7,q8,q10,q11,q18(mixed feature: join, scan, sort)

Performance Evaluation(5/6) With different workload file-intensive vm-intensive

Performance Evaluation(4/6)

Performance Evaluation(6/6) Two application access the same set of data With different access patterns

Conclusion On a multicore system, a thread can slow down its co-running threads flushes the to-be-reused data enhancing the page allocation policies in OS buffer. cache pollution is limited within the corresponding cache regions. SRM-buffer detects block sequences and allocates physical pages Our evaluation with a prototype improve application performance and decrease the execution times