hashfs Applying Hashing to Op2mize File Systems for Small File Reads Paul Lensing, Dirk Meister, André Brinkmann Paderborn Center for Parallel Compu2ng University of Paderborn
Mo2va2on and Problem Design Idea Implementa2on Evalua2on Conclusion Contents 2
Mo#va#on and Problem Design Idea Implementa2on Evalua2on Conclusion Contents 3
Scenario Web traffic, e.g. profile images, web site archive Internet Archive > 2500 nodes, > 6000 hard disks Similar: Facebook profile images > 6.5 billion images, most images 5-20 KB Challenges: Access to small files Mul2ple IOs per read request High Overhead Limits reads/s per disk 4
Mo2va2on Small files (4 KB 20 KB) Accesses are almost exclusively reads Filenames are randomly generated or calculated High number of concurrent users No name or directory locality Accesses are randomly distributed RAM/disk ra2o low Some POSIX features are not important Directory permission Last file access 2me (a2me) 5
Recursive directory lookup Example: Read file /var/image/20 Read dentries / Read dentries /var Read dentries /var/image Read file /var/image/20 6
Par2al solu2on Store small file data in inode Eliminates 1 IO opera2on But that opera2on is not seeking anyway Directory lookups are a major problem 7
Goal: 1 IO opera2on per read request (in web scenario) 8
Mo2va2on and Problem Design Idea Implementa2on Evalua2on Conclusion Contents 9
Compute loca2on of file No (recursive) lookup Basic Idea hash file path to posi2on on disk hash(/var/images/2010/05/03/10) 10
Hashing to tracks Exact posi2on is not a good approach Collisions Hash pathname to track Read complete track Overhead, but not prohibi2ve in random workload Needs geometry data of disk Also possible to use each data region as track 11
Mo2va2on and Problem Design Idea Implementa#on Evalua2on Conclusion Contents 12
hashfs Prototype filesystem Based on ext2 Transparent to applica2on and 3rd party tools No library, etc. Except directory permission checks Metadata Track inode per hashed file Stored in track inode block per track 13
Track Meta Data Every file has addi2onal track inode Only vital informa2on Inode number, size, security, Iden2ty hash, collision bit n direct pointers Track inode block Stored all track inodes of track First file system block of track Default: 113 track inodes per track inode block 14
Write new file Hash pathname to track Hash pathname to identity hash Write normal inode Read track inode block Search track inode via identity hash If track inode exists: # Collision Set collision bit Else: Create track inode 15
Read file Hash pathname to track Hash pathname to identity hash Read track inode block Search for track inode via identity hash If not exists or if collision bit is set: Fail # Use normal lookup Else: Success 16
Block Alloca2on Allocate block for hashed file Possible issue: No free blocks on hashed track Basic approach: Keep ra2o of track free for hashed files Depending on file system u2liza2on Use hashing approach only for small files 17
Other Issues Hashed track completely used by fs meta data Prevented by elimina2ng track from geometry data Duplicate iden2ty hashes No free track inodes in track inode block Alterna2ve lookup fails Normal lookup necessary as backup 18
0.06 Alloca#on issue ra#o (in 0%) 0.05 0.04 0.03 0.02 0.01 0 70% 75% 80% 85% 90% 95% Simula2on using file set distribu2on as on central AFS filesystem at UPB 19
Modifica2on of VFS layer Integra2on in Linux Bypass recursive directory lookup by pathname method Addi2onal func2on pointer Transparent to other file systems If pathname lookup has no result Use recursive directory lookup hashfs as addi2onal filesystem module geometry data via module op2ons 20
Mo2va2on and Problem Design Idea Implementa2on Evalua#on Conclusion Contents 21
Evalua2on Random file access distribu2on Uniform distribu2on Zipf- like distribu2on Web traffic is assumed to be zipf- like CDNs, memcache or other caching layers flaoen the skewness 22
Sepng File size: 4 KB File count: 18 Million Track size: 465 sectors per track Heuris2c: tracks have different sizes on disk Par22on: 90 GB Small par22on to limit evalua2on 2me Adjust cache size accordingly RAM size: 256 MB, 512 MB, 1024 MB 23
File sets Flat File Set: Files per directory: 10,000 Max depth of directory: 2 Deep File Set: Files per directory: 100 Max depth of directory: 6 Deep File Set beoer for ext2 Except with large caches Here: Only results for Deep File Set 24
Uniform distribu2on Reads per Second 90 80 70 60 50 40 30 20 10 0 256 MB 512 MB 1024 MB RAM size hashfs ext2 25
IO opera2ons per file HashFS ext2 10 10 9 9 8 8 IO reads per file 7 6 5 4 3 2 7 6 5 4 3 2 256MB 512MB 1024MB 1 1 0 0 6M 12M 18M 20M 6M 12M 18M 20M FIle count File Count 26
Zipf distribu2on zipf(0.50) zipf(0.95) 160 160 140 140 Reads per Second 120 100 80 60 40 120 100 80 60 40 hashfs ext2 20 20 0 0 256MB 512MB 1024MB 256MB 512MB 1024MB 27
Evalua2on Summary hashfs Nearly constant access 2me Only influenced by alloca2on problems due to u2liza2on ext2 Very different performance values Depends on Cache size and cache state Number and depth of directories Number of files 28
Mo2va2on and Problem Design Idea Implementa2on Evalua2on Conclusion Contents 29
Limits and State No full POSIX conformity Directory access permissions are not checked noa2me implicit Nega2ve performance impact for Large file reads Writes (update to track metadata) Current state Prototype implementa2on in Linux 2.6.18 Writes are implemented poorly Evalua2on in more complex sepngs 30
Conclusion Hashing pathnames is viable approach limited to certain situa2ons Approach not limited to ext2 All general- purpose file systems use recursive directory lookup approach IO opera2ons for lookup at cache miss 31
Ques2ons 32
Background:ext2 Boot sector Super block Block group 1 Block group 2 Block desc. block bitmap inode bitmap inode table data blocks Block group n 33
Background: inodes mode owner size block 1... block 12 2me stamps... 12 direct pointers 1 indirect pointers 1 double indirect pointers block 13 block 14 data block 1 triple indirect pointers... 1036 34
Influence of Cache State (ext2) Warm- up Phase 35
Track inode issues number of files Disk U#liza#on Track inodes issues (average) Ra#o (average) 1-13 million <69.6% 0 0% 14 million 74.7% 1.9 < 0.0001% 15 million 79.8% 25.7 0.0002% 16 million 84.9% 277.9 0.0017% 17 million 90.0% 1946.6 0.0115% 18 million 95.1% 10,170.0 0.0565% Simula2on using file set distribu2on as on central AFS filesystem at UPB 36