Managing Lustre TM Data Striping

Similar documents
European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

File Systems for HPC Machines. Parallel I/O

Parallel I/O on Theta with Best Practices

Remote Directories High Level Design

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

CMD Code Walk through Wang Di

API and Usage of libhio on XC-40 Systems

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

SMD149 - Operating Systems - File systems

Scalable I/O. Ed Karrels,

Tutorial: Lustre 2.x Architecture

Project Quota for Lustre

The JANUS Computing Environment

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Data Analysis on Ranger, January 19, 2012

File System Implementation

CLIO. Nikita Danilov Senior Staff Engineer Lustre Group

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

Lustre Capability DLD

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Example Implementations of File Systems

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Nathan Rutman SC09 Portland, OR. Lustre HSM

Lustre overview and roadmap to Exascale computing

Lustre Parallel Filesystem Best Practices

Inode. Local filesystems. The operations defined for local filesystems are divided in two parts:

<Insert Picture Here> Btrfs Filesystem

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Parallel Computing on Ranger and Lonestar, May 16, 2012

we are here Page 1 Recall: How do we Hide I/O Latency? I/O & Storage Layers Recall: C Low level I/O

DNE2 High Level Design

Using file systems at HC3

INTERNAL REPRESENTATION OF FILES:

UNIX File System. UNIX File System. The UNIX file system has a hierarchical tree structure with the top in root.

BTREE FILE SYSTEM (BTRFS)

The EXT2FS Library. The EXT2FS Library Version 1.37 January by Theodore Ts o

File Management 1/34

Parallel I/O Techniques and Performance Optimization

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

bytes per disk block (a block is usually called sector in the disk drive literature), sectors in each track, read/write heads, and cylinders (tracks).

The Journalling Flash File System

What is a file system

High Level Design IOD KV Store FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

Application I/O on Blue Waters. Rob Sisneros Kalyana Chadalavada

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

we are here I/O & Storage Layers Recall: C Low level I/O Recall: C Low Level Operations CS162 Operating Systems and Systems Programming Lecture 18

The Journalling Flash File System

Chapter 11: Implementing File-Systems

File Systems. Chapter 11, 13 OSPP

Chapter 11: File System Implementation

Welcome! Virtual tutorial starts at 15:00 BST

The UNIX File System

Chapter 11: File System Implementation

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Chapter 11: Implementing File Systems

CS 140 Project 4 File Systems Review Session

Chapter 12: File System Implementation

ETFS Design and Implementation Notes#

Chapter 10: File System Implementation

Input & Output 1: File systems

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

The EXT2FS Library. The EXT2FS Library Version 1.38 June by Theodore Ts o

NFS in Userspace: Goals and Challenges

Recent developments in GFS2. Steven Whitehouse Manager, GFS2 Filesystem LinuxCon Europe October 2013

CS 537 Fall 2017 Review Session

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

An Exploration of New Hardware Features for Lustre. Nathan Rutman

CS370 Operating Systems

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Lustre Data on MDT an early look

OPERATING SYSTEM. Chapter 12: File System Implementation

CSE 509: Computer Security

DLD for OPEN HANDLING in CMD

Coordinating Parallel HSM in Object-based Cluster Filesystems

CSE 333 SECTION 3. POSIX I/O Functions

Chapter 11: Implementing File

The UNIX File System

Logical disks. Bach 2.2.1

Chapter 11: File System Implementation. Objectives

Chapter 11: Implementing File Systems

Fall 2017 :: CSE 306. File Systems Basics. Nima Honarmand

File System Implementation

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

File Systems: Consistency Issues

ECE 598 Advanced Operating Systems Lecture 19

OCFS2 Mark Fasheh Oracle

An Overview of The Global File System

DAOS Lustre Restructuring and Protocol Changes Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

HLD For SMP node affinity

GridNFS: Scaling to Petabyte Grid File Systems. Andy Adamson Center For Information Technology Integration University of Michigan

Chapter 12 File-System Implementation

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

grib_api.h File Reference

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Chapter 12: File System Implementation

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Radix Tree, IDR APIs and their test suite. Rehas Sachdeva & Sandhya Bankar

[537] Journaling. Tyler Harter

Operating Systems Design Exam 2 Review: Spring 2011

Transcription:

Managing Lustre TM Data Striping Metadata Server Extended Attributes and Lustre Striping APIs in a Lustre File System Sun Microsystems, Inc. February 4, 2008

Table of Contents Overview...3 Data Striping in a Lustre File System...3 Striping Extended Attributes...3 Quality Attribute Scenarios...4 Striping Format...7 Striping Disk Format...8 Normal Striping EA Formats...8 LOV_MDS_MD...8 LOV_OST_DATA...9 Joined Striping EA format...9 Joined File Stripe Format...9 LOV_MDS_JOINED_MD...10 MDS_EXTENT_DESCRIPTION...10 JOINED File LOG Formats...11 Striping Memory Format...12 Striping User Format...14 Striping APIs...15 Get/Set Striping EA APIs...15 fsfilt_set/get_md...15 Pack/Unpack Striping EA APIs...16 obd_packmd...16 obd_unpackmd...17 Allocation/Free...17 obd_size_diskmd...17 obd_alloc_diskmd...18 obd_free_diskmd...18 obd_alloc_memmd...18 obd_free_memmd...19 Striping Location APIs...19 lov_stripe_size...19 lov_stripe_offset...20 lov_stripe_number...20 lfs APIs...21 llapi_file_get_stripe...21 llapi_file_open...21 Future Developments...22 Glossary...22 2 Managing Lustre TM Data Striping Sun Microsystems, Inc.

Overview In a Lustre file system, metadata describing where data is stored on object storage servers (OSTs) is defined in extended attributes (EAs) on the metadata server (MDS). This information, called the striping EA, is described in detail in this white paper. Also described are a set of APIs provided with Lustre that allow modules and applications to manipulate the striping EA. Data Striping in a Lustre File System In a Lustre file system, metadata and data are stored separately, in the metadata server (MDS) and in the object storage server (OST) respectively. Striping Extended Attributes When accessing a file, the client obtains data location information from the MDS. The location information indicates how the file is striped across the OSTs. Since this information is stored in the extend attributes (EA) of each inode in the MDS, it is called the striping EA. The status of the striping EA may be in-disk, in-memory (kernel mode inside Lustre), or in-application (striping EA in a user-level application). Each status corresponds to a different format. Lustre provides a set of APIs for other modules or applications to use to manipulate a striping EA. Below are a few examples showing how the striping EA is used by other Lustre modules. Use Case ID Quality Attribute Summary create-file usability A client creates a file. unlink-file usability A client unlinks a file. lfs-setstripe usability A client creates a file with a specified striping EA. MPI-LIB usability The MPI opens or creates a file with a specified striping EA. copy-file usability Copy files from Lustre to another filesystem (QFS, pnfs or GPFS), while retaining the same striping information. Sun Microsystems, Inc. Managing Lustre TM Data Striping 3

Quality Attribute Scenarios create-file Scenario: Client creates a new file. Business goals: Ensure that the basic POSIX function works. Relevant QAs: Usability Details Stimulus: Create a file Stimulus Client application source: Environment: Lustre-mounted client Striping API usages: The client sends a create request to the MDS. The MDS calls the striping API to distribute the create request to the OSTs to create the data objects. The striping information is then returned to the MDS. The MDS calls the striping API again to convert the striping information to the appropriate disk format and places it into the EA of the metadata object. unlink-file Scenario: Client unlinks a file. Business goals: Ensure that the basic POSIX function works. Relevant QAs: Usability Details Stimulus: Unlink a file Stimulus Client application source: Environment: Lustre-mounted client Striping API usages: A client sends an unlink request to the MDS. The MDS unlinks the metadata object and logs the action in the unlink log. The client then calls the striping API to locate the object on the OST and sends the unlink request to the OST. After the data objects of the OST are removed, the callback mechanism tells the MDS to remove the unlink log. 4 Managing Lustre TM Data Striping Sun Microsystems, Inc.

lfs-setstripe Scenario: Business goals: Relevant QAs: Details Stimulus: Stimulus source: Client opens/creates a file with a specified striping EA. Tune striping to meet user requirements. Usability Execute lfs setstripe. lfs setstripe and lfs getstripe utilities. Lustre also provides several lfs utilities to end users to set or get the striping information for a regular file or directory. Environment: Lustre-mounted client Striping API usages: In the current Lustre release, the striping EA of a regular file can only be set when it is opened or written the first time. So executing lfs-setstripe implies opening or creating the file with a specific striping EA. In the stripe-setting process, lfs first transfers the defined striping EA to the file system (Lustre client), then the Lustre client sends the open/create request with the striping EA to the MDS. The MDS calls the striping API to locate the OSTs according to the striping EA specification and creates the object on these OSTs. Then the MDS calls the striping API again to set the striping EA to the metadata object. Note: Limits for stripe settings are: Maximum striping count for a single file is 160. Maximum striping count for the system is 65532. Minimum striping size is 65536. Result of stripe_size * stripe_count should less than 0xffffffff. Sun Microsystems, Inc. Managing Lustre TM Data Striping 5

MPI-LIB Scenario: Client opens/creates a file with a specified striping EA in MPI-LIB Business Goals: Enable MPI-LIB (Lustre ADIO driver) to to execute lfs-setstripe directly. Relevant QAs: Usability Details Stimulus: Use MPI_open/create with stripe hints to open or create a file Stimulus source: MPI-LIB + Lustre ADIO driver Environment: Lustre-mounted client and MPI environment Striping API usages: The MPI uses the striping API only in MPI_Open (in the Lustre ADIO driver), where it may be necessary to open/create a file with a certain striping EA. The MPI programmer can set the striping EA using a hint. Below is an example showing how IOR is used to set a striping EA. IOR_HINT MPI striping_unit=1048576 #striping size is 1M IOR_HINT MPI striping_factor=2 #striping count is 2 IOR_HINT MPI striping_iodevice=0 #striping offset(index) is 0 The setting process is almost the same as for lfs-setstripe, but with one difference. In MPI, the ioctl system call is used directly to set the striping EA, instead of using an API from the Lustre user API lib, to avoid linking the unnecessary lib when building the MPI + Lustre ADIO driver. 6 Managing Lustre TM Data Striping Sun Microsystems, Inc.

copy-file Scenario: Business goals: Relevant QAs: Details Stimulus: Stimulus source: Copy files from Lustre to another filesystem (QFS, pnfs or GPFS). Copying files between Lustre and other filesystems (QFS, pnfs and GPFS), while retaining striping information without manual user intervention. Usability Copy files from a Lustre file system to another file system (QFS, pnfs and GPFS) while keeping the same striping pattern. Copy filesystem tool (modified star) is used to specify user-level Lustre striping. Environment: Lustre filesystem. Striping information for the Lustre and QFS filesystems is similar enough that the user-level tool (modified star) can convert one to the other. Striping API usages: Lustre provides a patch to the star backup tool to allow star to restore the complete Lustre file system with the same striping pattern as before. star can also be used in the copy process. For example, when file A is copied, star first calls the Lustre user-level striping API to extract the striping EA of file A from the MDS (in-application format). Then star starts to copy file A to the other file system (e.g. QFS). star creates a file on the target file system (possibly by using mknod) and sets the striping EA to that file. Since the striping formats for these two file systems are very similar, star should not change the striping EA or should make only minor modifications. Finally, star copies file A to the target file system according to the defined striping EA format. Striping Format The striping EA status designates three striping EA formats: In-disk format (lov_mds_md) Used when the striping EA is stored in disk. In-memory format (lov_stripe_md) Used when the striping EA is being read out from the disk and unpacked. User format (lov_user_md) Used when the striping EA is retrieved by the application and ready to output to the end user. Independent of the format, all striping EAs consist primarily of two parts: Public Applies to all the OSTs on which the file is located. Indicates how the file is striped over the OSTs. Private An array in which each array item corresponds to one OST. Each array item specifies the OST index and data object ID within it. When mapping the file offset to the special offset of the OST object, Lustre will compute the OST array index according to the file offset, striping size and striping count. Then it will go to the private OST array to obtain the OST index and object ID. Sun Microsystems, Inc. Managing Lustre TM Data Striping 7

Striping Disk Format Two striping disk formats are available: normal striping format for a normal file and joined striping format for a joined file. Normal Striping EA Formats The two parts of the normal striping EA, lov_mds_md (public) and lov_ost_data (OST private) are described below. struct lov_mds_md { }; /* LOV_MDS_MD */ u32 lmm_magic; u32 lmm_pattern; u64 lmm_object_id; u64 lmm_object_gr; u32 lmm_stripe_size; u32 lmm_stripe_count; /* LOV_OST_DATA */ struct lov_ost_data lmm_objects[0]; ID LOV_MDS_MD LOV_OST_DATA[] Striping information Location information for the objects. Each OST for this object corresponds to an entry in the array. LOV_MDS_MD Name Size lmm_magic 32 bits Normal file (0x0BD10BD0) lmm_pattern 32 bits Stripe pattern: RAID-0, RAID-1 or other network striping pattern. Only the RAID-0 pattern is currently supported. lmm_object_id 64 bits Object ID on the MDS, which is the ino of the object (inode) in the MDS. lmm_object_gr 64 bits For a directory, the object group number is used to determine if the striping EA for the directory is the default striping EA or a striping EA specified by lfs setstripe. For a file, the object group number is currently unused, but, in future releases, it will be used to identify groups of objects in a cluster metadata (CMD) environment. lmm_stripe_size 32 bits Stripe size: Number of bytes stored on each OST before moving to next OST. lmm_stripe_count 32 bits Stripe count: Number of stripes in the file. 8 Managing Lustre TM Data Striping Sun Microsystems, Inc.

LOV_OST_DATA struct lov_ost_data_v1 { }; u64 l_object_id; u64 l_object_gr; u32 l_ost_gen; u32 l_ost_idx; Name Size l_object_id 64 bits Object ID on the OST l_object_gr 64 bits Object group number (same as lmm_object_gr in LOV_MDS_MD_FORMAT_ID) l_ost_gen 32 bits Generation of l_ost_idx. l_ost_idx 32 bits OST index in the logical object volume (LOV) in the MDS server, which is handled by the management server (MGS) in the current version of Lustre. Joined Striping EA format A joined file is made up of several normal files, each with its own extent and corresponding striping EA. Joined File Stripe Format For a joined file, the striping disk formats include: Joined striping information (LOV_MDS_JOINED_MD) Striping extent information (MDS_EXTENT_DESCRIPTION). This information is stored in the log file for which the llog_log_id is defined in the joined striping EA. struct lov_mds_md_join { }; /* LOV_MDS_JOINED_MD */ struct lov_mds_md lmmj_md; /* MDS_EXTENT_DESCRIPTION*/ struct llog_logid lmmj_array_id; u32 lmmj_extent_count; ID LOV_MDS_JOINED_MD lmmj_md Striping information. The format is the same as for LOV_MDS_MD. JOINED_LOG_ID lmmj_extent_count The number of normal files in the joined file. ID for the log file containing the striping extent information. Sun Microsystems, Inc. Managing Lustre TM Data Striping 9

LOV_MDS_JOINED_MD Name Size lmm_magic 32 bits Joined file (0x0BD20BD0). lmm_pattern 32 bits Stripe pattern. For a joined file, each file should be the same pattern in the current version of Lustre. lmm_object_id 64 bits Object ID on the MDS, which is the ino of the object (inode) in the MDS. lmm_object_gr 64 bits For a directory, the object group number is used to determine if the striping EA for the directory is the default striping EA or a striping EA specified by lfs setstripe. For a file, the object group number is currently unused, but, in future releases, it will be used to identify groups of objects in a cluster metadata (CMD) environment. lmm_stripe_count 32 bits Total stripe count of each normal file in the joined file. lmm_stripe_size 32 bits Not used currently. lmmj_extent_count 32 bits Number of normal files in the joined file. MDS_EXTENT_DESCRIPTION For each joined file, extent striping information is stored in a log file, which is referred to by llog_logid. struct llog_logid { u64 u64 u32 lgl_oid; lgl_ogr; lgl_ogen; }; JOINED_LOG_ID Name Size lgl_oid 64 bits Log ID of the object lgl_ogr 64 bits Log group of the object lgl_ogen 32 bits Log generation of the object 10 Managing Lustre TM Data Striping Sun Microsystems, Inc.

JOINED File LOG Formats The joined log file is composed of joined log records. Each joined record includes a log header, a joined_record and a log tail. struct mds_extent_desc { u64 u64 struct lov_mds_md med_start; med_len; med_lmm; }; struct llog_rec_hdr { u32 u32 u32 u32 lrh_len; lrh_index; lrh_type; padding; }; struct llog_rec_tail { }; u32 lrt_len; u32 lrt_index; struct llog_array_rec { }; struct llog_rec_hdr lmr_hdr; struct mds_extent_desc lmr_med; struct llog_rec_tail lmr_tail; Sun Microsystems, Inc. Managing Lustre TM Data Striping 11

Name Size log_header lrh_len 32 bit Log record length lrh_index 32 bit Log record index lrh_type 32 bit Log record type padding 32 bit Record padding for 4 bytes aligned joined med_start 64 bits Offset of the extent for the normal file in the record joined file med_len 64 bits Length of the extent for the normal file in the joined file med_lmm size of LOV_MDS_MD Striping information for each normal file (same as LOV_MDS_MD) log_tail lrt_len 32 bit Log record length. The value is the same as for lrh_len. lrt_index 32 bit Log record index, The value is the same as for lrh_index. Striping Memory Format The in-memory striping EA also includes general striping information and private information for each OST. struct lov_oinfo { }; u64 loi_id; u64 loi_gr; int loi_ost_idx; int loi_ost_gen; /* used by the osc to keep track of what objects to build into rpcs */ struct loi_oap_pages loi_read_lop; struct loi_oap_pages loi_write_lop; /* _cli_ is poorly named, it should be _ready_ */ struct list_head loi_cli_item; struct list_head loi_write_item; struct list_head loi_read_item; unsigned loi_kms_valid:1; u64 loi_kms; struct ost_lvb loi_lvb; struct osc_async_rc loi_ar; 12 Managing Lustre TM Data Striping Sun Microsystems, Inc.

struct lov_stripe_md { }; lsm_lock /* General striping information */ spinlock_t lsm_lock; void *lsm_lock_owner; struct { u64 lw_object_id; u64 lw_object_gr; u64 lw_maxbytes; u32 lw_magic; u32 lw_stripe_size; u32 lw_pattern; unsigned lw_stripe_count; } lsm_wire; /* Private OST array */ struct lov_array_info *lsm_array; struct lov_oinfo *lsm_oinfo[0]; Name Size size of lsm lock to protect each item of the striping EA. spin_lock_t lsm_lock_owner size of void* Owner of the lsm_lock, for debugging purposes lsm striping lw_object_id 64 bit lov object ID (same as lmm_object_id) information lw_object_gr 64 bit lov object group number (same as lmm_object_gr) lw_max_bytes 64 bit Maximum possible file size lw_magic 32 bit lsm magic number (same as lmm_magic) lw_stripe_size 32 bit Size of the stripe (same as lmm_stripe_size) lw_stripe_pattern 32 bit Pattern of the stripe (same as lmm_stripe_pattern) OST array information lsm_array size of pointer Pointer to a lsm array, only for joined file loi_id 64 bit Data object ID (same as l_object_id) loi_gr 64 bit Data object group (same as l_object_gr) loi_ost_idx 64 bit OST index of the data object loi_ost_gen 64 bit OST generation of the data object loi_read_lop loi_write_lop size of struct loi_oap_pages size of struct loi_oap_pages List of pending read pages for the file for this object server client (OSC). List of pending write pages for the file for this OSC. Sun Microsystems, Inc. Managing Lustre TM Data Striping 13

loi_cli_item loi_read_item loi_write_item size of struct list_head size of struct list_head size of struct list_head List of objects ready to read/write for this OSC. List of objects to be read for this OSC. List of objects to be written for this OSC. loi_kms 64 bits Known minimum size of the data object loi_kms_valid loi_lvb loi_ar Striping User Format size of unsigned long size of struct ost_lvb size of struct osc_async_rc Valid flag for known minimum size Lock value block. Used to capture data object status information (size, time, etc.) communicated between the filter and OSC. The Lustre client system (llite) and LOV (llite/lov) merge the acquired information into a complete set of information about the file. Used to propagate asynchronous writeback errors back up to the application. If an asynchronous write fails, an error code is recorded and used later when an application executes an fsync operation. The striping user format is used when the striping EA is retrieved by a user-level application (for example, with lfs getstripe/setstripe). struct lov_user_ost_data_v1 { } u64 l_object_id; u64 l_object_gr; u32 l_ost_gen; u32 l_ost_idx; struct lov_user_md { } u32 lmm_magic; u32 lmm_pattern; u64 lmm_object_id; u64 lmm_object_gr; u32 lmm_stripe_size; u16 lmm_stripe_count; u16 lmm_stripe_offset; struct lov_user_ost_data_v1 lmm_objects[0]; 14 Managing Lustre TM Data Striping Sun Microsystems, Inc.

The user format differs in the following ways from the in-disk format: The user format has a lmm_stripe_offset, which the in-disk format does not have. lmm_stripe_offset is used by setstripe to transfer the striping_index parameters to Lustre when setting a stripe. For the user format, lmm_stripe_count has only 16 bits, while for in-disk format, stripe_count has 32 bits. So in the current Lustre release, the maximum stripe count is 65532. Striping APIs Lustre provides a set of APIs to handle the striping EAs. The five types of APIs are listed below according to their functionality: Set/get APIs. Used to set or get a striping EA to or from storage. Pack/unpack APIs. Because striping EAs are stored in packed format on disk, pack/unpack APIs are provided to pack and unpack striping EAs after a get or setstriping EA API is used. Allocate/free APIs. Used to allocate and free striping EAs in memory. Striping location APIs. Since location information for data objects is stored in striping EAs, APIs are provided to access the striping EAs and return data object location information. These APIs are also used to select the OST where the data object is to be created. lfs APIs. User-level APIs used by applications (lfs utilities) to handle striping EAs. The set/get APIs operate on striping EAs in in-disk format. The pack/unpack APIs operate on striping EAS in both in-disk and in-memory formats. The other APIs operate on striping EAs in in-memory format. Get/Set Striping EA APIs fsfilt_set/get_md int fsfilt_set_md(struct obd_device *obd, struct inode *inode, void *handle,void *md, int size, const char *name) int fsfilt_get_md(struct obd_device *obd, struct inode *inode, void *md, int size, const char *name) obd inode handle md size name Device of the object MDS object Journal handle for setting a striping EA Buffer of the striping EA Size of the striping EA Name (LOV) of the striping EA Sun Microsystems, Inc. Managing Lustre TM Data Striping 15

fsfilt_set_md 0 means success. A negative error number means an error. fsfilt_get_md 0 means success. A positive return value is the number of bytes that need to be added to the buffer to make it large enough to contain the striping EA. A negative error number means an error. Note: If the striping EA does not exist, get_md still returns 0. These two APIs are used by the MDS to get or set a striping EA. Pack/Unpack Striping EA APIs obd_packmd int obd_packmd(struct obd_export *exp, struct lov_mds_md **disk_tgt,struct lov_stripe_md *mem_src) exp disk_tgt mem_src Export of the device Disk structure for the striping EA In-memory structure for the striping EA If disk_tgt is NULL, striping size (in-memory structure*mem_src) is returned. If both disk_tgt and mem_src are NULL, the maximum possible stripe size is returned. If disk_tgt is not NULL and mem_src is NULL, @*disk_tgt is freed. If @*disk_tgt is NULL, an in-disk structure is allocated. This API packs the striping EA from an in-memory format to an in-disk description. 16 Managing Lustre TM Data Striping Sun Microsystems, Inc.

obd_unpackmd int obd_unpackmd(struct obd_export *exp, struct lov_stripe_md **mem_tgt,struct lov_mds_md *disk_src, int disk_len) exp mem_tgt disk_src disk_len Export of the device In-memory structure for the striping EA Disk structure for the striping EA Length of disk_tgt Positive value indicates the size of the unpacked striping EA. 0 is returned when the API tries to free the disk_src. Negative value indicates an error. This API unpacks the striping EA from an in-disk format (disk_src) to an in-memory description (mem_tgt). When mem_tgt is NULL, the API will free disk_src. Allocation/Free obd_size_diskmd void obd_size_diskmd(struct obd_export *exp, struct lov_stripe_md *mem_src) exp disk_tgt mem_src Export of the device. Disk structure for the striping EA. In-memory structure for the striping EA. If mem_src is not NULL, the striping size pointed to by mem_src is returned. If mem_src is NULL, the maximum striping size is returned. This API returns the real size of the striping EA. Sun Microsystems, Inc. Managing Lustre TM Data Striping 17

obd_alloc_diskmd int obd_alloc_diskmd(struct obd_export *exp, struct lov_mds_md **disk_tgt) exp disk_tgt Export of the device Allocated in-disk-formatted striping EA. 0 means success. A negative number means an error. This API returns the in-disk-formatted striping EA pointed to by disk_tgt. It allocates the maximum striping EA size, which typically equals the maximum data object count of the file * size of struct lov_ost. obd_free_diskmd int obd_free_diskmd(struct obd_export *exp, struct lov_mds_md **disk_tgt) exp disk_tgt Export of the device In-disk-formatted striping EA memory to be freed 0 means success. A negative number means an error. This API frees the in-disk-formatted striping EA referenced by *disk_tgt. obd_alloc_memmd int obd_alloc_memmd(struct obd_export *exp, struct lov_stripe_md **mem_tgt) exp mem_tgt Export of the device Allocated in-memory-formatted striping EA 0 means success. A negative number means an error. This API returns the in-memory-striping EA pointed to by mem_tgt. It allocates the maximum striping EA size. 18 Managing Lustre TM Data Striping Sun Microsystems, Inc.

obd_free_memmd int obd_free_memmd(struct obd_export *exp,struct lov_stripe_md **mem_tgt) exp mem_tgt Export of the device In-memory-formatted striping EA memory to be freed 0 means success. A negative number means an error. This API frees the in-memory-formatted striping EA referenced by *mem_tgt. Striping Location APIs lov_stripe_size obd_size lov_stripe_size(struct lov_stripe_md *lsm, obd_size ost_size, int stripeno) lsm ost_size stripeno In-memory striping EA Size of a single data object in an OST. Stripe number of the data object 0 means success. A negative number means an error. This API computes the file size given stripeno and the OST size, where stripeno and the OST size are associated with the OST where the end of the file is located. Sun Microsystems, Inc. Managing Lustre TM Data Striping 19

lov_stripe_offset int lov_stripe_offset(struct lov_stripe_md *lsm, obd_off lov_off, int stripeno, obd_off *obd_off) lsm lov_off stripeno obd_off In-memory striping EA Logic file offset Stripe number of the data object Offset of the OST indicated by stripeno, which is nearest to the logic file offset ( lov_off). 0 means the OST indicated by stripeno is exactly the same OST as the offset (lov_off) indicated. -1 means the index of the OST indicated by stripeno is less than the index of the OST indicated by the offset (lov_off). 1 means the index of the OST indicated by stripeno is larger than the index of the OST indicated by the offset (lov_off). This API is used to check whether an extent intersects with an OST. lov_stripe_number int lov_stripe_number(struct lov_stripe_md *lsm, obd_off lov_off) lsm lov_off In-memory striping EA Logic file offset 0 means success. A negative number means an error. This API computes which stripe number lov_off belongs to. 20 Managing Lustre TM Data Striping Sun Microsystems, Inc.

lfs APIs llapi_file_get_stripe int llapi_file_get_stripe(const char *path, struct lov_user_md *lum) path lum Path of the file Striping information returned to the caller 0 means success. A negative number means an error. This API returns striping information to the caller to be used by the application. llapi_file_open int llapi_file_open(const char *name, int flags, int mode, unsigned long stripe_size, int stripe_offset, int stripe_count, int stripe_pattern) name flags mode stripe_size Filename Open flags Open mode Stripe size of the file stripe_offset Stripe offset (stripe_index) of the file stripe_count stripe_patter n Stripe count of the file Stripe pattern of the file 0 means success. A negative number means an error. This API opens/creates a file with specified striping parameters. Sun Microsystems, Inc. Managing Lustre TM Data Striping 21

Future Developments With the currently implemented striping disk format, ->obd_unpackmd() must have an end-to-end understanding of all possible combinations of layouts, i.e., the format is basically flat rather than hierarchical. To facilitate development of new layouts, the striping disk format will be adjusted so that higher layers (e.g., struct lov_mds_md) can be parsed without knowing the details of the lower layer (in this case, struct lov_ost_data) representation. A straightforward way to do this is to precede each layout descriptor with the standard header: struct md_layout_descriptor_header { u16 mldh_magic; u16 mldh_length; }; where ->mldh_magic identifies the layout type and is used to determine the ->obd_unpackmd() method to be called to parse the descriptor; and ->mldh_length is the total descriptor length, which is used by the upper layer to pass over lower layer descriptors without understanding details of their representation. Care must be taken, however, to avoid introducing too much redundant information to the on-disk EA for the most common uses. 22 Managing Lustre TM Data Striping Sun Microsystems, Inc.

Glossary ADIO CMD EA llite LOV MDS MGS MPI OSC OST Analog-to-digital I/O. The ADIO driver is an abstract-device interface for parallel I/O that is used by the MPI to implement its I/O library. Cluster metatdata Extended attribute Lustre client system Logical object volume Metadata server Management server Message Passing Interface Object server client Object storage server Sun Microsystems, Inc. Managing Lustre TM Data Striping 23