IBM Research ADMS 214 1 st September 214 Flash-Conscious Cache Population for Enterprise Database Workloads Hyojun Kim, Ioannis Koltsidas, Nikolas Ioannou, Sangeetha Seshadri, Paul Muench, Clem Dickey, Larry Chiu IBM Research Almaden, Zurich ADMS 214 1 st September 214 Hangzhou, China 213 IBM Corporation
Outline Flash-based caching and why it is different Weaknesses of existing approaches : The Scalable Cache Engine Experimental evaluation using TPC-E Conclusions 2 213 IBM Corporation
IBM Research Host-based Flash Caching! Flash fills very well the gap between DRAM & HDDs Good fit as a caching layer between DRAM & HDDs! Host Server Application Server! Goal: Bring the date as close as possible to the application! Benefits " Low latency " Higher-throughput (typically) " Transparent to file systems, applications " Lower administration overhead compared to SAN Flash " SAN Congestion elimination, increased storage consolidation Direct-attached SSD Database Server Filesystem Block-level Cache Driver SAN! Typically employing a Write-Through mode SAN Storage 3 213 IBM Corporation
Flash-based vs. DRAM-based Caching Inherent differences necessitate a different way of thinking A Different place in the hierarchy Flash caches typically see different workloads than DRAM caches B Different memory technology Flash caches use NAND Flash, while DRAM caches use DRAM # 4 213 IBM Corporation
Difference A: Their place in the storage hierarchy! They encounter different workloads The DRAM cache receives the hottest portion The Flash cache sees a wider and sparser range of data with lower access frequency $ Flash caches see weaker locality! Flash caches need to have relatively large capacities $ Flash caches are prone to long warm-up times! Flash caches utilize a premium memory space for metadata High-performance Flash caches store metadata in main memory (DRAM) $ Flash cache metadata footprints may limit their scalability DRAM Flash Storage 5 213 IBM Corporation
Difference B: Their underlying memory technology DRAM Symmetric Read/Write Uniform, extremely low latency Does not wear out Flash Asymmetric Read/Write Higher latency Heavily dependent on access patterns Limited Endurance With Flash:! Cache population is expensive With DRAM we can cache all cache-missed data by just a memory copy! Unnecessary Flash writes reduce performance and shorten the device lifetime Less Flash bandwidth available for reads More so for small and random writes! Unconditional population can result in thrashing! Populating just the cache missed data results in unacceptably long warm-up times More than 1 hours for a 3GB Flash cache* 6 * Byan et al., Mercury: Host-side flash caching for the data center, MSST 212 213 IBM Corporation
Experimental setup for TPC-E! TPC-E workload! Experiments with existing open-source solutions:!!!! All three perform synchronous population of all cache misses in the user data path Write-through mode! Comparisons to: the baseline (nocache), an configuration ()!! Hypervisor: KVM!! OS: Fedora 19, Kernel 3.11.7-2.fc19.x86_64! Hardware: IBM System x365 M3 / 24 Cores 64GiB RAM! TPC-E (5, customers, 3 days, 4 threads)! PCI-e SSD! DB2 Express-C v1.5!!guest virtual machine!!os: RHEL6.4, 2.6.32-358.18.1.el6.x86_64!!Virtual hardware: 8 CPUs, 15GiB RAM! 16GiB volume with flash cache! Flash cache management layer! 2GB, enterprise-class device 16GiB partition on iscsi volume! Gigabit Ethernet!!! RAID:!!! 3x 73GB HDDs (15KRPM, 6Gb SAS)! iscsi Target! scsi-target-utils-1..24-3.el6_4.x86_64! OS: RHEL6.4, Kernel 2.6.32-358.23.2.el6.x86_64! Hardware: IBM System x365 M2 / 16 Cores 24GiB RAM! Transactions/sec! Avg. trans. latency! Cache statistics!! /proc/diskstats! raw storage access! statistics! 7 213 IBM Corporation
Transactions per second (TpsE) - TPC-E Evaluation results 1 1 9 9 8 8 7 7 6 6 5 5 4 4 3 5x 3 2 1.5x 2 1 1 Table 1: TPC-E average TpsE and memory usage with 1 2 3 4 5 6 7 8 three open-source 1 2 flash cache 3 solutions: 4 shows the highest average TpsE and biggest 5 6 7 8 Time memory (hours) Time (hours) usage. (Last 1 hour average) Memory Observations TpsE Read hit rate Usage! showed ~3.3x better performance 11.2 N/A N/A than, 87.9 N/A! N/A only achieved 67% of the 56.9 98.66% 1,212 performance MiB 17.3 96.75% 15! MiB The warm-up time was too long 17.5 96.52% 415 MiB! The hit rate is not the problem here 8 213 IBM Corporation
Transactions per second (TpsE) - TPC-E Evaluation results 1 1 9 9 8 8 7 7 6 6 5 5 4 4 The problem is that cache population occurs: 3 5x 3 - Unconditionally 2 1.5x 2 1 1 - In the foreground Table 1: TPC-E average TpsE and memory usage with 1 2 3 4 5 6 7 8 three - open-source 1 At too 2 fine flash cache a granularity 3 solutions: 4 shows the highest average TpsE and biggest 5 6 7 8 Time memory (hours) Time (hours) usage. (Last 1 hour average) Memory Observations TpsE Read hit rate Usage! showed ~3.3x better performance 11.2 N/A N/A than, 87.9 N/A! N/A only achieved 67% of the 56.9 98.66% 1,212 performance MiB 17.3 96.75% 15! MiB The warm-up time was too long 17.5 96.52% 415 MiB! The hit rate is not the problem here 9 213 IBM Corporation
: the Scalable Cache Engine A cache engine the makes population as Flash-friendly as possible Selective Caching Coarse-grained Cache Management Asynchronous Cache Population Asynchronous Write-Through 1 213 IBM Corporation
Selective Caching What?! Do not populate all read cache misses! Only populate the data that are deemed worthy of caching Why?! Benefits Cache pollution is avoided $ Increased hit rate Lower rate of writes to the SSD $ Higher SSD read bandwidth $ Lower SSD read latency Fewer total writes to the SSD $ Longer device lifetime How?! The cache continuously monitors the user access patterns! A recency-and-frequency filter is applied to population candidates! Only the data that are classified as hot are promoted into the cache! Various algorithms implemented For the presented experiments we have used a variation of MRU Cache miss Selective Caching Filter Hot? yes Populate no keep monitoring it 11 213 IBM Corporation
Coarse-grained Cache Management! A Fragment is 1MB of contiguous logical space! Cache management occurs at a fragment granularity A fragment is the unit of population, eviction, workload monitoring! User operations occur at 4kB page granularity Reads, invalidates, write-through Benefits! Small metadata footprint $ Scalability Only 76 bytes per 1MB-Fragment!! Exploits spatial locality in workloads Works effectively as a prefetching mechanism! Fast cache warm-up Metadata Footprint for a 2TB cache 12GB 4GB 152 MB 12 213 IBM Corporation
Asynchronous Background Population Linux Device Mapper Framework!! Population is a background task The population unit a fragment (1MB)! Outside the user data path Does not add latency to user reads! The cache has full control How much to populate When to populate! The cache limits the population rate by limiting the number of threads as the hit rate grows as the SSD latency increases A pop.! task! Write a fragment! P" Scalable Cache Engine ()! Completion! of tasks! Asynchronous population worker worker! thread! Caching device! (SSD)! Invalidation! AWT worker thread! Read! a fragment! Cache hit / miss! Write! Request! Cache-hit! Read! Request! Backing device! (HDD)! Cache-miss! Controlled population rate, flash-friendly writes, minimum impact to the foreground operations 13 213 IBM Corporation
Asynchronous Write-Through 1. 4KiB Write! 4KiB! AWT Task Queue! Background AWT Worker Thread! 6. De-queue AWT task! 7. Update flash! 8. Validate Bitmap! 4KiB! 3. En-queue AWT Task! (Memory copy)! 4KiB! 4KiB! 2. Invalidate Bitmap! 5. Complete write I/O! Page validity bitmap! 4. Write-through to the backing device! Caching device! (SSD)!! Follows the same principles as Population! Occurs outside the user data path Backing device! (HDD)!! The cache can control how much and when to write through The user write latency becomes independent from the SSD write latency! 14 213 IBM Corporation
TPC-E with 2GB Flash Cache Transactions per second (TpsE) 1 1 8 8 Read Read hit rate hit becomes rate becomes 1%: 1%: only one thread remains active for cache population only one thread remains active for cache population flashcac 6 6 Read Read hit rate hit becomes rate becomes 95%: 95%: only only two threads two threads do population do population 7.5x 4 1% 4 1 98.7% 1% 1 98.7% 98 5x 98 96.7%96.5% 96.7%96.5% 2 96 96 1.5x 2 94 94 d rate hit rate becomes becomes 1%: 1%: 92 92 one thread thread remains remains active active for for cache population no-cac 1 2 3 4 5 9 9 6 7 1 2 3 5 6 7 8 1 2 3 Time (hours) 4 5 6 7 on Time (h 1% -cache 1 1% 98.7% (a) (b) (b) 1 98.7% 64.2 64.2 98 98 96.7%96.5% 52. 5.7 96.7%96.5% 52. 5.7 96 96 94 94 92 92 4.6 12. 5.4 9 4.6 12. 5.4 9 3 4 5 6 7 8 3 4 Time (hour) 15 (a) 5 6 7 8 213 IBM Corporation Time (hour) (c) Avg. read hit rate (%) Avg. read hit rate (%) ur r Avg. read hit rate (%) Avg. read hit rate (%) Avg. Res.(ms) / last hour Avg. Res.(ms) / last hour
TPC-E with 2GB Flash Cache Transactions per second (TpsE) 1 1 8 8 Read Read hit rate hit becomes rate becomes 1%: 1%: only one thread remains active for cache population only one thread remains active for cache population flashcac 6 6 Read Read hit rate hit becomes rate becomes 95%: 95%: only only two threads two threads do population do population 7.5x 4 1% 4 1 98.7% 1% 1 98.7% 98 5x 98 96.7%96.5% 96.7%96.5% 2 96 96 1.5x 2 94 94 d rate hit rate becomes becomes 1%: 1%: 92 92 one thread thread remains remains active active for for cache population no-cac 1 2 3 4 5 9 9 6 7 1 2 3 5 6 7 8 1 2 3 Time (hours) 4 5 6 7 on Time (h 1% -cache 1 1% 98.7% (a) (b) (b) 1 98.7% 64.2 64.2 98 98 96.7%96.5% 52. 5.7 96.7%96.5% 52. 5.7 96 96 94 94 92 92 4.6 12. 5.4 9 4.6 12. 5.4 9 3 4 5 6 7 8 3 4 Time (hour) 16 (a) 5 6 7 8 213 IBM Corporation Time (hour) (c) Avg. read hit rate (%) Avg. read hit rate (%) ur r Avg. read hit rate (%) Avg. read hit rate (%) Avg. Res.(ms) / last hour Avg. Res.(ms) / last hour
TPC-E with 3GB Flash Cache The data (16GB) do not fit in the cache Continuous population and eviction are exercised 1 1 TpsE Transactions per second (TpsE) 17 8 6 4 2 8 6 4 2 Read hit rate becomes 95%: only two threads do population Read hit rate becomes 1%: only one thread remains active for cache population flashcach no-cach 1 2 3 4 1 2 3 Time (a) (hours) 4 5 6 7 Time (hour) Time (ho (a) Thrashing amplifies the problem 3.6x - 4.9x 213 IBM Corporation
TPC-E with 3GB Flash Cache I/O Statistics I/O traffic (MiB/s) I/O traffic (MiB/s) 1 8 6 4 2 35 3 25 2 15 1 5 SSD Read SSD Read SSD Write iscsi Read iscsi Read iscsi Write SSD Write 1 2 3 4 Time (hour) (a) iscsi Read SSD Write SSD Read 1 2 3 4 Time (hour) (d) 18 213 IBM Corporation
Volume of Flash writes SSD Write Amount / Transaction (KiB) 4 35 3 25 2 15 1 5 43% - 59% fewer writes 142.5 145.2 16.4 6.2 21% - 57% fewer writes 379.9 33. 29.4 164.8 8 hour TPC-E with 2GiB Flash cache 4 hour TPC-E with 3GiB Flash cache Increasingly important as we are moving towards c-mlc and TLC Flash! 19 213 IBM Corporation
Conclusions! Traditional cache population schemes are bad for Flash-based caches Main conclusions: 1. 1 Population should be selective i.e., do not promote all cache-missed data 2. 2 A coarse-grained cache management is beneficial Small metadata footprint Short warm-times Benefits from prefetching 3. Population should be a background operation i.e., outside the user data path Similarly for write-through operations 4. 4 Population should occur in chunks in the order of 1MB 2 213 IBM Corporation
IBM Easy Tier Server for DS8 Distributed host-based Flash caching for DS8 client hosts! Implements the algorithms and techniques described! For IBM Power hosts running the IBM AIX OS! The storage server manages cluster cache coherence! Integration between the host-side Flash caches and the automated tiering on the storage server (Easy Tier )! More information: Software-defined just-in-time caching in an enterprise storage system IBM Journal of Research and Development vol.58, no.2/3, pp.7:1,7:13, March-May 214 IBM System Storage DS8 Easy Tier Server http://www.redbooks.ibm.com/redpapers/pdfs/redp513.pdf 21 213 IBM Corporation
22 213 IBM Corporation
Backup 23 213 IBM Corporation
Transactions per second (TpsE) with varying fragment size 1 8 7 6 5 8 6 Baseline (1M sized frag., with AWT) without AWT 4 Read hit rate becomes 95%: only two threads do population 4 1 91.1% 3 1 91.1% 82.9% 91.1% 89.8% 8M sized frag. 8 8 73.6% ne (1M 2sized frag., with AWT) 6 6 5.4% 2 4 4 1 24.7% 2 2 no-cach 1 2 3 4 5 6 7 1 2 3 4 1 2 3 Time (hours) 4 5 6 7 Time (ho (a) ache 1% (b) (b) 1 1 91.1% 98.7% 91.1% 89.8% 2 898 73.6% 8M sized frag. 7 64.2 17. 59.4 6.4 96.7%96.5% 156 696 5 1.5 1 4 4 36.1 94 6.3 6.3 2 53 92 2 lashcache 9 1 4.6 6.3 3 4 24 2 3 4 213 IBM Corporation (a) (b) Time (hour) (c) thout AWT Avg. read hit-rate (%) Avg. read hit rate (%) our r 128K sized frag. frag.8m frag.128k without AWT baseline Read hit rate becomes 1%: only one thread remains active for cache population without AWT Avg. read hit-rate (%) Avg. read hit-rate (%) Avg. Res.(ms) / last hour Avg. Res.(ms) / last hour frag.8m frag.128k without AWT baseline frag.8m frag.128k without AWT baseline EnhanceI flashcach
Memory Footprint Memory usage (MiB) 12 1 8 6 4 2 1.1G 15M 415M 34M Memory usage (MiB) 7 6 5 4 3 2 1 6.G 34M 2.1G 93M Flash cache size: 2 GiB Flash cache size: 1 TiB 25 213 IBM Corporation