Replication, History, and Grafting in the Ori File System Ali José Mashtizadeh, Andrea Bittau, Yifeng Frank Huang, David Mazières Stanford University
Managed Storage $5-10/GB+ $1/GB/Year Local Storage $0.04/GB
What s missing? Data management Availability Data is always live. Accessibility Data is globally accessible. Durability Data is never lost. (History, Snapshots, Backup) Usability Collaboration and version control are easy
Ori File System Goal: All the benefits of Managed Storage, implemented with hardware you already own. Local Storage $0.04/GB
Two Main Usage Models Personal storage Shared storage Public Folders Public Folders
Managed storage limitations today Bandwidth - Limited by WAN bandwidth Privacy Storage cost - $ per GB of managed solutions Poor integration of replication, versioning & sharing - Copying files across machines - Apple Time Machine, Windows 8 File History, Applications implement their own versioning - Emailing documents, Distributed version control
Idea: Leverage trends to do better Big disks Fast LANs Mobile storage
Growth (log scale) Disk vs WAN Throughput Growth 100000 10000 1000 100 10 1 Transfer time: Internet Speed Disk Space 1990 2013 14 hours 278 days 468x Transfer Time Gap!
Ori design principles Store not just files but file history - Take advantage of disk space Replicate files and history widely - Make replication easy and instantaneous - No master replica (OK if any device fails) - Uses LAN speed and disk space Use history for sharing
Ori Provides History Public Folders Replication File Sharing with History (Grafting) Recovery
History
SFSRO/Git-like Data Model Content Addressable Storage... Older Commit Commit SHA-256 Hash Tree Tree Globally unique namespace Large Blob Tree Tree Tree Blob Deduplication Blob (fragment) Blob (fragment) Blob (shared)
Apply DVCS Techniques Merge diverging replicas Detect conflicts - No magic bullets for all file types - Make merge base available - 3-way merge line-oriented files Provide convenient tools - History, snapshots, branches,
Storage Layout Objects are deduplicated, compressed, and stored Log structured storage (files on your local file system) Index used to lookup object locations
Replication Simplify data management
Today Backup Centralized File Storage Dropbox SCP/Rsync/Airdrop
Egalitarian Replication
Replication subsumes backup Crash! Recover with Replication Background Fetch optimization makes replica creation feel instantaneous
Replication in Ori Opportunistic replication (Use LAN) - Bulk transport over SSH Automatic device discovery and synchronization - UDP multicast messages 5 second interval - Set a cluster name and symmetric key - Protected by AES-CBC
Replicate Deltas Delta... Older Commit Commit Delta consists of a collection of objects Tree Large Blob Tree Tree Tree Versioning makes Tree Blob replication easy! Blob (fragment) Blob (fragment) Blob (shared) Δ Δ
Protocol Content Addressable Storage: Objects are identical on disk and wire - No rewriting of objects Reference Counting: Decompress metadata to update reference counts - Decompression is faster than compression
Distributed Fetch WAN (Mbps) Depends on content addressable storage Trade off Storage for Bandwidth Fast LAN (Gbps) Unrelated File System
Grafting File Sharing with History
Collaboration Today Cloud Over Email Version Control
File Sharing with Versioning We want the file system to manage versioning and sharing Require no forethought in setting up version control No more insane naming: Presentation_Alice_Final_Bob_2_F inal.pptx
Grafting in Ori Alice s Latest Alice s Latest Snapshot Snapshot Alice: A 1 A 2 A 3 B 3* Bob: B 1 B 2 A 1* A 2* A 3* Cross repository links B 3 Commit History
Conflicts in Ori Detects conflicts using history Automatic merging when possible Otherwise, provide files for 3-way merge file, file:conflict, file:base Conflicts rarely occur in single user model Conflicts more likely with Grafts merges are explicit
Mobile Devices Sneakernets!
Today: Device space underutilized icloud, Google Drive, Office 365/SkyDrive
Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 Capacity (GB) Data Carriers: Phone Storage Space 140 120 100 80 60 40 20 0
Bandwidth (Mbps) Fast wireless networks 10000 Per-stream Bandwidth 802.11ad 1000 100 10 802.11 802.11b 802.11g 802.11n 802.11ac 1 Oct-95 Jul-98 Apr-01 Jan-04 Oct-06 Jul-09 Apr-12 Dec-14 4-8 Streams (MIMO)
Sneakernets
Sneakernets
Sneakernets Average Commute in US: 25 Minutes Carry 16 GB Storage 5.2 Gbps Effective Bandwidth
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway. - Andrew S. Tanenbaum
Performance
Performance File system benchmarks: Filebench Network file system: Source code build * Everything measured on an SSD, except the network benchmark
File system in User Space (FUSE) Ori is built using FUSE Benchmark FUSE Driver (orifs, loopback) User Space Baseline against the FUSE loopback FUSE Kernel Module Kernel Ext4 Compare: ext4, ori, loopback SSD
Architecture orifs (FUSE Driver) FS Metadata In Memory (directories, fstat) Staging Area (File Data Only) libori Blob Tree Commit HttpStorage LocalStorage Connection Manager SSHStorage Object Storage (Packfiles) Index Metadata Staging Area (Data Cache) ext4
Operations/s (Normalized) Filebench: Synthetic Workloads 2.5 2 * 1.5 1 0.5 0 fileserver webserver varmail webproxy networkfs ext4 ori loopback Higher is better
Time (s) Time (s) Ori vs NFS: Remote compile 60 50 40 30 20 10 0 LAN (1 Gbps) 20.4519.45 NFSv3 NFSv4 Ori Ori w/bf 11.33 16.04 WAN (2/20 Mbps 17 ms) 60 50 40 30 20 10 0 54.85 44.07 15.3 19.34 40% longer 23% longer Lower is better BF = On-demand Background Fetch
Related Work Network File Systems AFP, CIFS, LBFS, NFS, Shark, Distributed File Systems AFS, Disconnected File Systems Coda, Ficus, JetFile, Intermezzo, Archival File Systems Elephant, Plan 9, WAFL, Wayback, ZFS, Version Control Git, Mercurial, Application Solutions Bayou, Dropbox,
Lessons Learned Hardware and use cases have evolved File systems need to catch up! Replication is no longer just for data-centers Keeping file history should be the default Mobile devices create an opportunity for better solutions - Fast LAN, Large Storage, Sneakernets
Future Work Application Support for Merging on Ori API Complications Merges can surprise applications and users Event notification? Integrating Grafting and Orisync Authentication
Questions? Visit: http://ori.scs.stanford.edu/ Available for OS X, Linux, and FreeBSD See paper for details on additional features
Backup Slides
Mobile Device Battery Life Use 802.11 (or USB) Better for battery life Some platforms have: - Periodic callbacks (opportunistic optimize battery life) - Geofencing callbacks (wake up when arriving at a location)
Operations Per Second Bonnie: IO Benchmark 300000 250000 200000 150000 100000 50000 0 16K read 16K write 16K rewrite ext4 ori loopback Higher is better
Time (s) Distributed Fetch - Performance 180 160 140 120 100 80 60 40 20 0 7.75 Distributed Pull 132.05 Partially Distributed Pull 170.79 Remote Pull Remote pull of Python 3.2.3 source Peer either has Python 2.7.3 or 3.2.3 Source Nearby Peer Destinatio n Internet 110ms 290/530KB up/down
Ori vs NFS NFSv3 NFSv4 Ori Ori on-demand LAN WAN LAN WAN LAN WAN LAN WAN Replicate 0.49 s 2.93 s Configure 8.14 s 21.52 s 7.25 s 15.54 s 0.66 s 0.66 s 1.01 s 1.33 s Build 12.32 s 33.33 s 12.20 s 28.54 s 9.50 s 9.55 s 11.45 s 12.77 s Snapshot 0.19 s 0.19 s 2.72 s 3.37 s Push 0.49 s 1.58 s 0.85 s 1.89 s Total 20.45 s 54.85 s 19.45 s 44.07 s 11.33 s 15.30 s 16.04 s 19.34 s