Lustre A Platform for Intelligent Scale-Out Storage

Lustre A Platform for Intelligent Scale-Out Storage Rumi Zahir, rumi. May 2003 rumi.zahir@intel.com

Agenda Problem Statement Trends & Current Data Center Storage Architectures The Lustre File System Project Intelligent Storage Discussion 2

Problem Statement Goals A scalable, shared, coherent, persistent file store built from commodity components that appears to its users as a single big system. Scale I/O bandwidth/latency, data availability and capacity Reduce admin cost by supporting on-line data migration, load balancing and incremental hot add/remove of sub-components Proactive intelligent storage to accelerate data retrieval and indexing functions 3

Enterprise Trend: Storage moving out of the box Because Fast processors need lots of disks to keep them busy, and lots of disks won t fit in the box More spindles? lower latency, higher bandwidth Storage access protocols are already message based (e.g. SCSI, NFS) Disk latency is such that increased distance does not hurt Specialized software on storage boxes can be optimized, and makes them easier to manage Storage needs to be shared 4

Current Storage Architectures Network-Attached Storage (NAS) file servers with specialized easy-to to-manage software Appl. Servers Storage Area Networks (SAN) pooled block storage, but no concurrent sharing Appl. Servers Ethernet (NFS, CIFS) NAS Storage Fibre Channel (FCP) SAN Disks 5

Storage Scalability Limiters Network Attached Storage (NAS) File sharing protocols (e.g. NFS, CIFS) Lack support for file striping across servers Write-through through caching (poor write performance) Synchronous meta-data updates serialize directory ops e.g. update access time on every read could result in a file server op (NFS relaxes this) Name space (server:disk:partition:file) encodes location NAS file servers are performance and management bottlenecks Storage Area Networks (SAN) Block-based storage abstraction shares disk space But, no concurrent file sharing between clients Distributed databases use distributed lock managers for sharing access to block-level level storage devices 6

The Lustre Project Goal Develop scalable object-based based file system with cluster-wide POSIX semantics for Linux Scalability: 10,000 clients, 1,000 OST, 10 MDS 7 Collaborative 3-year 3 Linux open-source project Cluster File Systems, Inc. Peter Braam is Lustre Architect & Technical Project Lead Strong Linux team with ext3 file system experience HP Network Storage Systems Operation Project Management, Testing & Productization Intel contributes to Instrumentation & performance analysis, storage targets National Labs (Livermore, Los Alamos, Sandia) 3 year R&D funding, large clusters

Cluster File Systems (Gradual Evolution) Symmetrical Block-Based Based Block-oriented oriented with distributed lock manager for coherence Peer-2-peer coherence protocols Focus: Fibrechannel SANs,, typical single OS (except Veritas) Examples: Sistina (GFS), IBM (GPFS), Veritas (CFS) Asymmetrical Block-Based Based Block-oriented oriented with out-of of-data-path meta-data servers Focus: : SAN, Multi-OS (Windows, Linux, ), Management Examples: Polyserve (Matrix Server), Veritas (SanPoint), IBM (StorageTank), EMC (Highroad) Asymmetrical Object-Based (Emerging) Block allocation & security fcts migrate to disk/storage contr. Examples: CMU/NASD, Cluster File Systems (Lustre), IBM (StorageTank( StorageTank), Panasas 8

Scalable Shared Storage IPC (small messages) Clients = App. Servers Meta-Data Servers Coherence Management Data Transfers (Bulk) Storage Management Object Storage Targets 9

Lustre Scalability Enablers Object Storage Disk block allocation abstracted from clients & metadata servers? fewer items to keep coherent File I/O Protocol Choice of Clients cache file data w/ write-behind (good for small files) Direct I/O without caching (good for large files) Vectored zero-copy bulk data transfers Preposted receive buffers, protocol is RDMA/DDP enabled Support TCP/IP, Quadrics, Myrinet Stripe single file across multiple servers Client-side logical object volumes enable concurrent I/O between a client and multiple servers on a single file Different files can have different striping patterns. 10

Lustre Scalability Enablers 2 Metadata Protocol Metadata can be cached on clients Allow caching when no contention Write-behind caching with recoverable journal Uses Intermezzo style server replay log Revert to client/server model in case of heavy sharing Intent-based VFS lookups reduce #RPCs# Tightly coupled distributed lock manager Modeled after the VAX cluster DLM Multiple lock namespaces Metadata locks: {P}R, {P}W, EX on file ids {inode{ inode/gen#} Extent locks: byte range locking on objects Distribute extent locks over object storage targets 11

Intelligent Storage Unlike disks or block-level level storage arrays, Lustre Object Storage Targets (OSTs( OSTs) ) have knowledge of logically contiguous file chunks Use OST intelligence to: Improve response times Prefetch based on file content or file-type Optimize disk data layout & caching policies Add new proactive functionality Snapshot/Versioning through copy-on on-write Helps solve backup/restore problem Proactive indexing and/or pattern matching engine Move computation to data instead of data to computation Opportunistically take advantage of unused storage device bandwidth/cycles to pro-actively build indices 12

User POSIX / VFS Interface Lustre Client Lustre MDS Query Processing Service Object Storage Targets Lustre OST OST OST OST OST OST networking Object Based Disk Server (OBD server) Lock Server OST Indexer OST Query Engine Object Based Disk (OBD) alternatives OST pre-computes & stores contentbased indices Ext2 OBD OBD Filter File system XFS, JFS, Ext3, Lock Server indicates change stability 13

Content-Based Indexing Query Processing Service Receives user queries Parallelizes queries across OSTs Aggregates query results Communicates with MDS for queries with pathnames & striped objects OST - Indexer / Query Engine User-defined indexing & query functions Opportunistic indexing & change tracking Support simple query aggregation Support for Live Queries (notification on match) 14

Aggregated Query Index Function Index Function Indexer Live Query List Live Queries Query Query Engine Query OST Interface - Maintains list of changed, unindexed objects 15

Proactive Indexing & Query Summary Query interface orthogonal to file system Proactive, opportunistic indexing integrated into OST User definable indexing / query functions Distributed query processing Push & pull model for queries 16

Intelligent Storage Research Projects @ Intel Self-Tuning Optimized data placement & caching policies (with CMU PDL ) Content-Based Indexing Image Matching (Intel( CMU Lablet) Integration with Lustre (Intel( Santa Clara) Using Lustre as Object Storage Research Platform For more information Intel R&D: http://www.intel intel.com/labs/storage Lustre: http://www.lustre lustre.org Work @ CMU: http://www.ece ece.cmu.edu/~ /~mmesnier 17