OpenAFS and Large Volumes. Mark Vitale AFS and Kerberos Best Practices Workshop 19 August 2015

Similar documents
Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition,

Memory Management: The process by which memory is shared, allocated, and released. Not applicable to cache memory.

OpenAFS Unix Cache Manager Performance Mark Vitale AFS and Kerberos Best Practices Workshop 20 August 2015

CS510 Operating System Foundations. Jonathan Walpole

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches

Operating System Concepts Ch. 11: File System Implementation

File System Internals. Jo, Heeseung

416 Distributed Systems. Distributed File Systems 2 Jan 20, 2016

Directory. File. Chunk. Disk

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

File System Implementation. Sunu Wibirama

Computer Systems Laboratory Sungkyunkwan University

CS3600 SYSTEMS AND NETWORKS

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 11: Implementing File Systems

File System Implementation

[537] Journaling. Tyler Harter

FILE SYSTEM IMPLEMENTATION. Sunu Wibirama

NTFS Recoverability. CS 537 Lecture 17 NTFS internals. NTFS On-Disk Structure

File System Implementation

Chapter 11: File System Implementation. Objectives

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition,

CS 4284 Systems Capstone

[537] Fast File System. Tyler Harter

Advanced UNIX File Systems. Berkley Fast File System, Logging File System, Virtual File Systems

OPERATING SYSTEM. Chapter 12: File System Implementation

Distributed File Systems

The Berkeley File System. The Original File System. Background. Why is the bandwidth low?

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

CS5460: Operating Systems Lecture 20: File System Reliability

Hit The Ground Running: AFS

Principles of Operating Systems

ICS Principles of Operating Systems

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

SMD149 - Operating Systems - File systems

Operating Systems. File Systems. Thomas Ropars.

Example Implementations of File Systems

File Systems. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse]

COMP 530: Operating Systems File Systems: Fundamentals

Chapter 12: File System Implementation

Bhaavyaa Kapoor Person #

Chapter 11: Implementing File Systems

File Systems. Kartik Gopalan. Chapter 4 From Tanenbaum s Modern Operating System

CS-537: Midterm Exam (Spring 2009) The Future of Processors, Operating Systems, and You

File Systems: Fundamentals

Implementation should be efficient. Provide an abstraction to the user. Abstraction should be useful. Ownership and permissions.

File Systems Ch 4. 1 CS 422 T W Bennet Mississippi College

Chapter 10: Case Studies. So what happens in a real operating system?

Chapter 12: File System Implementation

Chapter 11: Implementing File

File Systems: Fundamentals

What is a file system

Local File Stores. Job of a File Store. Physical Disk Layout CIS657

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Announcements. Persistence: Log-Structured FS (LFS)

Case study: ext2 FS 1

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Memory Management. Dr. Yingwu Zhu

Operating Systems Design Exam 2 Review: Fall 2010

Chapter 12: File System Implementation

CS 318 Principles of Operating Systems

File System: Interface and Implmentation

Chapter 7: File-System

Chapter 10: File System Implementation

CS720 - Operating Systems

Case study: ext2 FS 1

CS 318 Principles of Operating Systems

Announcements. Persistence: Crash Consistency

File Systems. Information Server 1. Content. Motivation. Motivation. File Systems. File Systems. Files

CS370 Operating Systems

CIS Operating Systems File Systems. Professor Qiang Zeng Fall 2017

CIS Operating Systems File Systems. Professor Qiang Zeng Spring 2018

File Systems. Chapter 11, 13 OSPP

File Systems: Naming

File System Implementation

Operating Systems. Operating Systems Professor Sina Meraji U of T

Pintos Project 4 File Systems. November 14, 2016

CS 550 Operating Systems Spring File System

A Heap of Trouble Exploiting the Linux Kernel SLOB Allocator

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25

CSE380 - Operating Systems

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Hacking AFS Dumps for Fun and Profit

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

CS 537 Fall 2017 Review Session

Chapter 14: File-System Implementation

Operating Systems Design 16. Networking: Remote File Systems

Virtual File System. Don Porter CSE 306

3/26/2014. Contents. Concepts (1) Disk: Device that stores information (files) Many files x many users: OS management

Operating Systems, Fall

Chapter 11: Implementing File-Systems

CS370 Operating Systems

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Transcription:

OpenAFS and Large Volumes Mark Vitale <mvitale@sinenomine.net> AFS and Kerberos Best Practices Workshop 19 August 2015

objective / overview We will explore some of the current challenges and solutions for using OpenAFS with very large volumes. 2

How big is a piece of string? 3

case study A large university stores research project data on a single OpenAFS volume: millions of files and directories, almost 1 terabyte total content-addressable directory hierarchy e.g. one directory per person query_subject_nnnnn AFS files-per-directory limits forced awkward workarounds original volume okay, but unable to make a usable copy vos move/copy/dump/restore would run for many hours then fail triggered salvage which would run for many more hours, then either fail or produce damaged data workaround: tar/untar WAY out of band 4

outline / agenda big data considerations internals addressing the issues futures questions and answers 5

big data considerations architectural limits implementation limits server limits (disk, memory) operational limits (time, SLA) configuration changes performance issues (scalability) bugs increased exposure of infrequently exercised code paths 6

architectural limits OpenAFS must maintain interoperability with other AFS servers and clients as described in existing architecture, specification, and standards documents. Much of this is wire format for rx, RPCs, and objects. Two that are relevant to large volumes are the FID and the directory object. 7

the FID (File Identifier) uniquely identifies a file on the wire struct AFSFid { afs_uint32 Volume; afs_uint32 Vnode; afs_uint32 Unique; } architectural (wire) limit 2^32 (4Gi) Volumes per cell Vnodes per volume (dirs, files, symlinks) 8

the directory object architectural (wire) limits: no more than 64436 (not 65536) objects (files, dirs, symlinks) per directory each name no longer than 15 chars longer object names reduce the runtime limit for a given directory by consuming more (contiguous) blocks fragmentation may also reduce the runtime limit object name 256 bytes or less details in Mike Meffie s EAKC 2012 presentation: http://conferences.inf.ed.ac.uk/eakc2012/slides/eakc-2012-dirdefrag.pdf 9

the Volume Location database (VLDB) architectural (wire) limit in current RPCs of 2 GiB maximum size for an ubik database this is enough to hold approximately 14.5 million volumes per cell in the current VLDB format 4 far less than the 4 Gi limit the FID imposes 10

implementation limits 255 fileservers per cell 255 vice partitions per fileserver fileserver vice partition formats: old inode Windows namei Unix namei 11

vice partition format Unix namei /vicepxx partition Vnnnnnnnnn.vol files (VolumeDiskHeader) AFSIDat directory 1st level volume directories (hashed lower bits of RW volid) - volume directories (hashed RW volid) - 1st level inode directories (hashed high bits of vnode number) - 2nd level inode directories (hashed next 9 bits) - inode files (more hashing) - special directory - volume info 1 per volume in the VG (RW, RO, BK, clones) - large vnode index 1 per volume in the VG - small vnode index 1 per volume in the VG - link index 1 per VG deliberate incorporation of metadata redundancies 12

I'm not really sure what (the) intended purpose is. To me, this is one of those "historical reasons" where "historical reasons" means "it doesn't make any sense but it was like this when we got it. -- Andrew Deason 13

namei implementation limits NAMEI_VNODEMASK = 0x03FFFFFF 26-bit (not 32) vnode number 67,108,863 max vnodes per volume half small files, symlinks (including AFS mountpoints) half large directories (and associated ACLs) max 20 ACLs per directory max 7 links to a shared (cloned) vnode (RW, RO, BK, 4 clones) 14

Working As Implemented 15

bugs WAD Working As Designed WAI Working As Implemented unintentional implementation limits large volumes + infrequently exercised code paths = undiscovered bugs examples: salvage vos operations: move, copy, dump, restore 16

Can AFS make a volume so large that it cannot move it? 17

typeless arithmetic the heart of many of these issues is that the vnode number is used in unsafe calculations: bit offset into an internal volume memory map byte offset into the vnode index special files 18

vos move bugs are rarely exposed for very large volumes because no one wants to move them - it takes too long performance may mask bugs, as well as being a kind of bug in its own right. 19

salvager looks for errors and inconsistencies in AFS volume data and metadata, and makes a best effort to repair them. one of those programs (like restore) that is not used very often, but when you do you really hope it works! what if the salvager breaks, or takes too long? 20

Fix all the bugs you want we ll make more! (with apologies to Lay s potato chips) 21

addressing the issues files-per-directory limit workaround: implement site-specific nesting of user directories to skirt the AFS limits workaround: use shorter file and directory names when possible workaround: use defrag tool (ask Mike) standardization: extended directories 22

addressing the issues errors and corruption during volume operations (copy/move/dump/restore) root cause: incorrect (signed) type for a vnode number led to a 31/32 bit overflow when calculating an offset into a vnode index. patch submitted to fix the specific issue (gerrit 10447) reviewers called for code audit and a general fix new patches in test: 31/32 bit (signed/unsigned) type errors type-dirty macros (index offsets) convert to inline static functions error recovery bugs exposed along the way 23

addressing the issues salvager assertion fail while salvaging a large volume exposed by vos move bugs (previous slide) root cause: SalvageIndex had an incorrect (signed) type for a vnode number; this caused the vnode index offset logic to overflow at 2^31 and pass a negative file offset to IH_WRITE patch in gerrit 24

addressing the issues mkdir may crash dafileserver bug 1: allocating a new vnode overflows the index logic at 2^31 due to incorrect type; issues addled bitmap and goes to rarely-traversed error recovery code patch in gerrit bug 2: error recovery mishandles volume reference counting and crashes on a failed assert patch in gerrit 25

addressing the issues salvager resource requirements uses invisible (unlinked) temp work files e.g. salvage.inodes contains info on each inode default location is on the partition itself may grow quite large for large volumes, leading to unexpected disk space exhaustion during salvage recommend using tmpdir parm to put these elsewhere uses large memory-resident tables vnode essence tables that distill the vnode index metadata may result in a visit from the Linux OOM-killer 26

addressing the issues configuration changes just creating the test volumes exposed some problems in my cell config: cache manger: memcache/disk cache; separate partition; separate device; adequate stat caches (to prevent vcache invalidation and VLRU thrashing) fileserver: adequate callbacks (to prevent callback GSS) first accesses exposed more perf considerations cold CM caches cold fileserver caches ls command coloring causes extra FetchStatus calls CM prefetch activity 27

addressing the issues performance & scalability vos move/copy/dump/forward profiling salvager dir buffer optimizations salvage index scanning bottlenecks fileserver housekeeping optimizations 28

"Depend upon it, sir, when a man knows he is to be hanged in a fortnight, it concentrates his mind wonderfully. -- Samuel Johnson 29

current status under gerrit review: 10447 vol: VnodeId type consistency for vnode numbers 11983 DAFS: dafileserver crash after "addled bitmap 11984 DAFS: dafileserver failed assertion (vp->nusers >= 0) 11985 vol: DEBUG_BITMAP for vnode allocation debug, test 11986 vol: vnodeindexoffset not typesafe 30

futures Finish current audit/bug hunt/cleanup project Death to salvager? Why do we even have a salvager? to repair (after the fact) damage caused by: hardware software failures AFS design shortcomings (non-logging, etc?) AFS bugs DAFS improved some pain points in the salvager: greatly reduced salvage related outages salvage on demand, with no outage for other volumes salvage in parallel Extended directory objects? 31

Questions? 32