Managing Lustre TM Data Striping

Size: px
Start display at page:

Download "Managing Lustre TM Data Striping"

Transcription

1 Managing Lustre TM Data Striping Metadata Server Extended Attributes and Lustre Striping APIs in a Lustre File System Sun Microsystems, Inc. February 4, 2008

2 Table of Contents Overview...3 Data Striping in a Lustre File System...3 Striping Extended Attributes...3 Quality Attribute Scenarios...4 Striping Format...7 Striping Disk Format...8 Normal Striping EA Formats...8 LOV_MDS_MD...8 LOV_OST_DATA...9 Joined Striping EA format...9 Joined File Stripe Format...9 LOV_MDS_JOINED_MD...10 MDS_EXTENT_DESCRIPTION...10 JOINED File LOG Formats...11 Striping Memory Format...12 Striping User Format...14 Striping APIs...15 Get/Set Striping EA APIs...15 fsfilt_set/get_md...15 Pack/Unpack Striping EA APIs...16 obd_packmd...16 obd_unpackmd...17 Allocation/Free...17 obd_size_diskmd...17 obd_alloc_diskmd...18 obd_free_diskmd...18 obd_alloc_memmd...18 obd_free_memmd...19 Striping Location APIs...19 lov_stripe_size...19 lov_stripe_offset...20 lov_stripe_number...20 lfs APIs...21 llapi_file_get_stripe...21 llapi_file_open...21 Future Developments...22 Glossary Managing Lustre TM Data Striping Sun Microsystems, Inc.

3 Overview In a Lustre file system, metadata describing where data is stored on object storage servers (OSTs) is defined in extended attributes (EAs) on the metadata server (MDS). This information, called the striping EA, is described in detail in this white paper. Also described are a set of APIs provided with Lustre that allow modules and applications to manipulate the striping EA. Data Striping in a Lustre File System In a Lustre file system, metadata and data are stored separately, in the metadata server (MDS) and in the object storage server (OST) respectively. Striping Extended Attributes When accessing a file, the client obtains data location information from the MDS. The location information indicates how the file is striped across the OSTs. Since this information is stored in the extend attributes (EA) of each inode in the MDS, it is called the striping EA. The status of the striping EA may be in-disk, in-memory (kernel mode inside Lustre), or in-application (striping EA in a user-level application). Each status corresponds to a different format. Lustre provides a set of APIs for other modules or applications to use to manipulate a striping EA. Below are a few examples showing how the striping EA is used by other Lustre modules. Use Case ID Quality Attribute Summary create-file usability A client creates a file. unlink-file usability A client unlinks a file. lfs-setstripe usability A client creates a file with a specified striping EA. MPI-LIB usability The MPI opens or creates a file with a specified striping EA. copy-file usability Copy files from Lustre to another filesystem (QFS, pnfs or GPFS), while retaining the same striping information. Sun Microsystems, Inc. Managing Lustre TM Data Striping 3

4 Quality Attribute Scenarios create-file Scenario: Client creates a new file. Business goals: Ensure that the basic POSIX function works. Relevant QAs: Usability Details Stimulus: Create a file Stimulus Client application source: Environment: Lustre-mounted client Striping API usages: The client sends a create request to the MDS. The MDS calls the striping API to distribute the create request to the OSTs to create the data objects. The striping information is then returned to the MDS. The MDS calls the striping API again to convert the striping information to the appropriate disk format and places it into the EA of the metadata object. unlink-file Scenario: Client unlinks a file. Business goals: Ensure that the basic POSIX function works. Relevant QAs: Usability Details Stimulus: Unlink a file Stimulus Client application source: Environment: Lustre-mounted client Striping API usages: A client sends an unlink request to the MDS. The MDS unlinks the metadata object and logs the action in the unlink log. The client then calls the striping API to locate the object on the OST and sends the unlink request to the OST. After the data objects of the OST are removed, the callback mechanism tells the MDS to remove the unlink log. 4 Managing Lustre TM Data Striping Sun Microsystems, Inc.

5 lfs-setstripe Scenario: Business goals: Relevant QAs: Details Stimulus: Stimulus source: Client opens/creates a file with a specified striping EA. Tune striping to meet user requirements. Usability Execute lfs setstripe. lfs setstripe and lfs getstripe utilities. Lustre also provides several lfs utilities to end users to set or get the striping information for a regular file or directory. Environment: Lustre-mounted client Striping API usages: In the current Lustre release, the striping EA of a regular file can only be set when it is opened or written the first time. So executing lfs-setstripe implies opening or creating the file with a specific striping EA. In the stripe-setting process, lfs first transfers the defined striping EA to the file system (Lustre client), then the Lustre client sends the open/create request with the striping EA to the MDS. The MDS calls the striping API to locate the OSTs according to the striping EA specification and creates the object on these OSTs. Then the MDS calls the striping API again to set the striping EA to the metadata object. Note: Limits for stripe settings are: Maximum striping count for a single file is 160. Maximum striping count for the system is Minimum striping size is Result of stripe_size * stripe_count should less than 0xffffffff. Sun Microsystems, Inc. Managing Lustre TM Data Striping 5

6 MPI-LIB Scenario: Client opens/creates a file with a specified striping EA in MPI-LIB Business Goals: Enable MPI-LIB (Lustre ADIO driver) to to execute lfs-setstripe directly. Relevant QAs: Usability Details Stimulus: Use MPI_open/create with stripe hints to open or create a file Stimulus source: MPI-LIB + Lustre ADIO driver Environment: Lustre-mounted client and MPI environment Striping API usages: The MPI uses the striping API only in MPI_Open (in the Lustre ADIO driver), where it may be necessary to open/create a file with a certain striping EA. The MPI programmer can set the striping EA using a hint. Below is an example showing how IOR is used to set a striping EA. IOR_HINT MPI striping_unit= #striping size is 1M IOR_HINT MPI striping_factor=2 #striping count is 2 IOR_HINT MPI striping_iodevice=0 #striping offset(index) is 0 The setting process is almost the same as for lfs-setstripe, but with one difference. In MPI, the ioctl system call is used directly to set the striping EA, instead of using an API from the Lustre user API lib, to avoid linking the unnecessary lib when building the MPI + Lustre ADIO driver. 6 Managing Lustre TM Data Striping Sun Microsystems, Inc.

7 copy-file Scenario: Business goals: Relevant QAs: Details Stimulus: Stimulus source: Copy files from Lustre to another filesystem (QFS, pnfs or GPFS). Copying files between Lustre and other filesystems (QFS, pnfs and GPFS), while retaining striping information without manual user intervention. Usability Copy files from a Lustre file system to another file system (QFS, pnfs and GPFS) while keeping the same striping pattern. Copy filesystem tool (modified star) is used to specify user-level Lustre striping. Environment: Lustre filesystem. Striping information for the Lustre and QFS filesystems is similar enough that the user-level tool (modified star) can convert one to the other. Striping API usages: Lustre provides a patch to the star backup tool to allow star to restore the complete Lustre file system with the same striping pattern as before. star can also be used in the copy process. For example, when file A is copied, star first calls the Lustre user-level striping API to extract the striping EA of file A from the MDS (in-application format). Then star starts to copy file A to the other file system (e.g. QFS). star creates a file on the target file system (possibly by using mknod) and sets the striping EA to that file. Since the striping formats for these two file systems are very similar, star should not change the striping EA or should make only minor modifications. Finally, star copies file A to the target file system according to the defined striping EA format. Striping Format The striping EA status designates three striping EA formats: In-disk format (lov_mds_md) Used when the striping EA is stored in disk. In-memory format (lov_stripe_md) Used when the striping EA is being read out from the disk and unpacked. User format (lov_user_md) Used when the striping EA is retrieved by the application and ready to output to the end user. Independent of the format, all striping EAs consist primarily of two parts: Public Applies to all the OSTs on which the file is located. Indicates how the file is striped over the OSTs. Private An array in which each array item corresponds to one OST. Each array item specifies the OST index and data object ID within it. When mapping the file offset to the special offset of the OST object, Lustre will compute the OST array index according to the file offset, striping size and striping count. Then it will go to the private OST array to obtain the OST index and object ID. Sun Microsystems, Inc. Managing Lustre TM Data Striping 7

8 Striping Disk Format Two striping disk formats are available: normal striping format for a normal file and joined striping format for a joined file. Normal Striping EA Formats The two parts of the normal striping EA, lov_mds_md (public) and lov_ost_data (OST private) are described below. struct lov_mds_md { }; /* LOV_MDS_MD */ u32 lmm_magic; u32 lmm_pattern; u64 lmm_object_id; u64 lmm_object_gr; u32 lmm_stripe_size; u32 lmm_stripe_count; /* LOV_OST_DATA */ struct lov_ost_data lmm_objects[0]; ID LOV_MDS_MD LOV_OST_DATA[] Striping information Location information for the objects. Each OST for this object corresponds to an entry in the array. LOV_MDS_MD Name Size lmm_magic 32 bits Normal file (0x0BD10BD0) lmm_pattern 32 bits Stripe pattern: RAID-0, RAID-1 or other network striping pattern. Only the RAID-0 pattern is currently supported. lmm_object_id 64 bits Object ID on the MDS, which is the ino of the object (inode) in the MDS. lmm_object_gr 64 bits For a directory, the object group number is used to determine if the striping EA for the directory is the default striping EA or a striping EA specified by lfs setstripe. For a file, the object group number is currently unused, but, in future releases, it will be used to identify groups of objects in a cluster metadata (CMD) environment. lmm_stripe_size 32 bits Stripe size: Number of bytes stored on each OST before moving to next OST. lmm_stripe_count 32 bits Stripe count: Number of stripes in the file. 8 Managing Lustre TM Data Striping Sun Microsystems, Inc.

9 LOV_OST_DATA struct lov_ost_data_v1 { }; u64 l_object_id; u64 l_object_gr; u32 l_ost_gen; u32 l_ost_idx; Name Size l_object_id 64 bits Object ID on the OST l_object_gr 64 bits Object group number (same as lmm_object_gr in LOV_MDS_MD_FORMAT_ID) l_ost_gen 32 bits Generation of l_ost_idx. l_ost_idx 32 bits OST index in the logical object volume (LOV) in the MDS server, which is handled by the management server (MGS) in the current version of Lustre. Joined Striping EA format A joined file is made up of several normal files, each with its own extent and corresponding striping EA. Joined File Stripe Format For a joined file, the striping disk formats include: Joined striping information (LOV_MDS_JOINED_MD) Striping extent information (MDS_EXTENT_DESCRIPTION). This information is stored in the log file for which the llog_log_id is defined in the joined striping EA. struct lov_mds_md_join { }; /* LOV_MDS_JOINED_MD */ struct lov_mds_md lmmj_md; /* MDS_EXTENT_DESCRIPTION*/ struct llog_logid lmmj_array_id; u32 lmmj_extent_count; ID LOV_MDS_JOINED_MD lmmj_md Striping information. The format is the same as for LOV_MDS_MD. JOINED_LOG_ID lmmj_extent_count The number of normal files in the joined file. ID for the log file containing the striping extent information. Sun Microsystems, Inc. Managing Lustre TM Data Striping 9

10 LOV_MDS_JOINED_MD Name Size lmm_magic 32 bits Joined file (0x0BD20BD0). lmm_pattern 32 bits Stripe pattern. For a joined file, each file should be the same pattern in the current version of Lustre. lmm_object_id 64 bits Object ID on the MDS, which is the ino of the object (inode) in the MDS. lmm_object_gr 64 bits For a directory, the object group number is used to determine if the striping EA for the directory is the default striping EA or a striping EA specified by lfs setstripe. For a file, the object group number is currently unused, but, in future releases, it will be used to identify groups of objects in a cluster metadata (CMD) environment. lmm_stripe_count 32 bits Total stripe count of each normal file in the joined file. lmm_stripe_size 32 bits Not used currently. lmmj_extent_count 32 bits Number of normal files in the joined file. MDS_EXTENT_DESCRIPTION For each joined file, extent striping information is stored in a log file, which is referred to by llog_logid. struct llog_logid { u64 u64 u32 lgl_oid; lgl_ogr; lgl_ogen; }; JOINED_LOG_ID Name Size lgl_oid 64 bits Log ID of the object lgl_ogr 64 bits Log group of the object lgl_ogen 32 bits Log generation of the object 10 Managing Lustre TM Data Striping Sun Microsystems, Inc.

11 JOINED File LOG Formats The joined log file is composed of joined log records. Each joined record includes a log header, a joined_record and a log tail. struct mds_extent_desc { u64 u64 struct lov_mds_md med_start; med_len; med_lmm; }; struct llog_rec_hdr { u32 u32 u32 u32 lrh_len; lrh_index; lrh_type; padding; }; struct llog_rec_tail { }; u32 lrt_len; u32 lrt_index; struct llog_array_rec { }; struct llog_rec_hdr lmr_hdr; struct mds_extent_desc lmr_med; struct llog_rec_tail lmr_tail; Sun Microsystems, Inc. Managing Lustre TM Data Striping 11

12 Name Size log_header lrh_len 32 bit Log record length lrh_index 32 bit Log record index lrh_type 32 bit Log record type padding 32 bit Record padding for 4 bytes aligned joined med_start 64 bits Offset of the extent for the normal file in the record joined file med_len 64 bits Length of the extent for the normal file in the joined file med_lmm size of LOV_MDS_MD Striping information for each normal file (same as LOV_MDS_MD) log_tail lrt_len 32 bit Log record length. The value is the same as for lrh_len. lrt_index 32 bit Log record index, The value is the same as for lrh_index. Striping Memory Format The in-memory striping EA also includes general striping information and private information for each OST. struct lov_oinfo { }; u64 loi_id; u64 loi_gr; int loi_ost_idx; int loi_ost_gen; /* used by the osc to keep track of what objects to build into rpcs */ struct loi_oap_pages loi_read_lop; struct loi_oap_pages loi_write_lop; /* _cli_ is poorly named, it should be _ready_ */ struct list_head loi_cli_item; struct list_head loi_write_item; struct list_head loi_read_item; unsigned loi_kms_valid:1; u64 loi_kms; struct ost_lvb loi_lvb; struct osc_async_rc loi_ar; 12 Managing Lustre TM Data Striping Sun Microsystems, Inc.

13 struct lov_stripe_md { }; lsm_lock /* General striping information */ spinlock_t lsm_lock; void *lsm_lock_owner; struct { u64 lw_object_id; u64 lw_object_gr; u64 lw_maxbytes; u32 lw_magic; u32 lw_stripe_size; u32 lw_pattern; unsigned lw_stripe_count; } lsm_wire; /* Private OST array */ struct lov_array_info *lsm_array; struct lov_oinfo *lsm_oinfo[0]; Name Size size of lsm lock to protect each item of the striping EA. spin_lock_t lsm_lock_owner size of void* Owner of the lsm_lock, for debugging purposes lsm striping lw_object_id 64 bit lov object ID (same as lmm_object_id) information lw_object_gr 64 bit lov object group number (same as lmm_object_gr) lw_max_bytes 64 bit Maximum possible file size lw_magic 32 bit lsm magic number (same as lmm_magic) lw_stripe_size 32 bit Size of the stripe (same as lmm_stripe_size) lw_stripe_pattern 32 bit Pattern of the stripe (same as lmm_stripe_pattern) OST array information lsm_array size of pointer Pointer to a lsm array, only for joined file loi_id 64 bit Data object ID (same as l_object_id) loi_gr 64 bit Data object group (same as l_object_gr) loi_ost_idx 64 bit OST index of the data object loi_ost_gen 64 bit OST generation of the data object loi_read_lop loi_write_lop size of struct loi_oap_pages size of struct loi_oap_pages List of pending read pages for the file for this object server client (OSC). List of pending write pages for the file for this OSC. Sun Microsystems, Inc. Managing Lustre TM Data Striping 13

14 loi_cli_item loi_read_item loi_write_item size of struct list_head size of struct list_head size of struct list_head List of objects ready to read/write for this OSC. List of objects to be read for this OSC. List of objects to be written for this OSC. loi_kms 64 bits Known minimum size of the data object loi_kms_valid loi_lvb loi_ar Striping User Format size of unsigned long size of struct ost_lvb size of struct osc_async_rc Valid flag for known minimum size Lock value block. Used to capture data object status information (size, time, etc.) communicated between the filter and OSC. The Lustre client system (llite) and LOV (llite/lov) merge the acquired information into a complete set of information about the file. Used to propagate asynchronous writeback errors back up to the application. If an asynchronous write fails, an error code is recorded and used later when an application executes an fsync operation. The striping user format is used when the striping EA is retrieved by a user-level application (for example, with lfs getstripe/setstripe). struct lov_user_ost_data_v1 { } u64 l_object_id; u64 l_object_gr; u32 l_ost_gen; u32 l_ost_idx; struct lov_user_md { } u32 lmm_magic; u32 lmm_pattern; u64 lmm_object_id; u64 lmm_object_gr; u32 lmm_stripe_size; u16 lmm_stripe_count; u16 lmm_stripe_offset; struct lov_user_ost_data_v1 lmm_objects[0]; 14 Managing Lustre TM Data Striping Sun Microsystems, Inc.

15 The user format differs in the following ways from the in-disk format: The user format has a lmm_stripe_offset, which the in-disk format does not have. lmm_stripe_offset is used by setstripe to transfer the striping_index parameters to Lustre when setting a stripe. For the user format, lmm_stripe_count has only 16 bits, while for in-disk format, stripe_count has 32 bits. So in the current Lustre release, the maximum stripe count is Striping APIs Lustre provides a set of APIs to handle the striping EAs. The five types of APIs are listed below according to their functionality: Set/get APIs. Used to set or get a striping EA to or from storage. Pack/unpack APIs. Because striping EAs are stored in packed format on disk, pack/unpack APIs are provided to pack and unpack striping EAs after a get or setstriping EA API is used. Allocate/free APIs. Used to allocate and free striping EAs in memory. Striping location APIs. Since location information for data objects is stored in striping EAs, APIs are provided to access the striping EAs and return data object location information. These APIs are also used to select the OST where the data object is to be created. lfs APIs. User-level APIs used by applications (lfs utilities) to handle striping EAs. The set/get APIs operate on striping EAs in in-disk format. The pack/unpack APIs operate on striping EAS in both in-disk and in-memory formats. The other APIs operate on striping EAs in in-memory format. Get/Set Striping EA APIs fsfilt_set/get_md int fsfilt_set_md(struct obd_device *obd, struct inode *inode, void *handle,void *md, int size, const char *name) int fsfilt_get_md(struct obd_device *obd, struct inode *inode, void *md, int size, const char *name) obd inode handle md size name Device of the object MDS object Journal handle for setting a striping EA Buffer of the striping EA Size of the striping EA Name (LOV) of the striping EA Sun Microsystems, Inc. Managing Lustre TM Data Striping 15

16 fsfilt_set_md 0 means success. A negative error number means an error. fsfilt_get_md 0 means success. A positive return value is the number of bytes that need to be added to the buffer to make it large enough to contain the striping EA. A negative error number means an error. Note: If the striping EA does not exist, get_md still returns 0. These two APIs are used by the MDS to get or set a striping EA. Pack/Unpack Striping EA APIs obd_packmd int obd_packmd(struct obd_export *exp, struct lov_mds_md **disk_tgt,struct lov_stripe_md *mem_src) exp disk_tgt mem_src Export of the device Disk structure for the striping EA In-memory structure for the striping EA If disk_tgt is NULL, striping size (in-memory structure*mem_src) is returned. If both disk_tgt and mem_src are NULL, the maximum possible stripe size is returned. If disk_tgt is not NULL and mem_src is is freed. is NULL, an in-disk structure is allocated. This API packs the striping EA from an in-memory format to an in-disk description. 16 Managing Lustre TM Data Striping Sun Microsystems, Inc.

17 obd_unpackmd int obd_unpackmd(struct obd_export *exp, struct lov_stripe_md **mem_tgt,struct lov_mds_md *disk_src, int disk_len) exp mem_tgt disk_src disk_len Export of the device In-memory structure for the striping EA Disk structure for the striping EA Length of disk_tgt Positive value indicates the size of the unpacked striping EA. 0 is returned when the API tries to free the disk_src. Negative value indicates an error. This API unpacks the striping EA from an in-disk format (disk_src) to an in-memory description (mem_tgt). When mem_tgt is NULL, the API will free disk_src. Allocation/Free obd_size_diskmd void obd_size_diskmd(struct obd_export *exp, struct lov_stripe_md *mem_src) exp disk_tgt mem_src Export of the device. Disk structure for the striping EA. In-memory structure for the striping EA. If mem_src is not NULL, the striping size pointed to by mem_src is returned. If mem_src is NULL, the maximum striping size is returned. This API returns the real size of the striping EA. Sun Microsystems, Inc. Managing Lustre TM Data Striping 17

18 obd_alloc_diskmd int obd_alloc_diskmd(struct obd_export *exp, struct lov_mds_md **disk_tgt) exp disk_tgt Export of the device Allocated in-disk-formatted striping EA. 0 means success. A negative number means an error. This API returns the in-disk-formatted striping EA pointed to by disk_tgt. It allocates the maximum striping EA size, which typically equals the maximum data object count of the file * size of struct lov_ost. obd_free_diskmd int obd_free_diskmd(struct obd_export *exp, struct lov_mds_md **disk_tgt) exp disk_tgt Export of the device In-disk-formatted striping EA memory to be freed 0 means success. A negative number means an error. This API frees the in-disk-formatted striping EA referenced by *disk_tgt. obd_alloc_memmd int obd_alloc_memmd(struct obd_export *exp, struct lov_stripe_md **mem_tgt) exp mem_tgt Export of the device Allocated in-memory-formatted striping EA 0 means success. A negative number means an error. This API returns the in-memory-striping EA pointed to by mem_tgt. It allocates the maximum striping EA size. 18 Managing Lustre TM Data Striping Sun Microsystems, Inc.

19 obd_free_memmd int obd_free_memmd(struct obd_export *exp,struct lov_stripe_md **mem_tgt) exp mem_tgt Export of the device In-memory-formatted striping EA memory to be freed 0 means success. A negative number means an error. This API frees the in-memory-formatted striping EA referenced by *mem_tgt. Striping Location APIs lov_stripe_size obd_size lov_stripe_size(struct lov_stripe_md *lsm, obd_size ost_size, int stripeno) lsm ost_size stripeno In-memory striping EA Size of a single data object in an OST. Stripe number of the data object 0 means success. A negative number means an error. This API computes the file size given stripeno and the OST size, where stripeno and the OST size are associated with the OST where the end of the file is located. Sun Microsystems, Inc. Managing Lustre TM Data Striping 19

20 lov_stripe_offset int lov_stripe_offset(struct lov_stripe_md *lsm, obd_off lov_off, int stripeno, obd_off *obd_off) lsm lov_off stripeno obd_off In-memory striping EA Logic file offset Stripe number of the data object Offset of the OST indicated by stripeno, which is nearest to the logic file offset ( lov_off). 0 means the OST indicated by stripeno is exactly the same OST as the offset (lov_off) indicated. -1 means the index of the OST indicated by stripeno is less than the index of the OST indicated by the offset (lov_off). 1 means the index of the OST indicated by stripeno is larger than the index of the OST indicated by the offset (lov_off). This API is used to check whether an extent intersects with an OST. lov_stripe_number int lov_stripe_number(struct lov_stripe_md *lsm, obd_off lov_off) lsm lov_off In-memory striping EA Logic file offset 0 means success. A negative number means an error. This API computes which stripe number lov_off belongs to. 20 Managing Lustre TM Data Striping Sun Microsystems, Inc.

21 lfs APIs llapi_file_get_stripe int llapi_file_get_stripe(const char *path, struct lov_user_md *lum) path lum Path of the file Striping information returned to the caller 0 means success. A negative number means an error. This API returns striping information to the caller to be used by the application. llapi_file_open int llapi_file_open(const char *name, int flags, int mode, unsigned long stripe_size, int stripe_offset, int stripe_count, int stripe_pattern) name flags mode stripe_size Filename Open flags Open mode Stripe size of the file stripe_offset Stripe offset (stripe_index) of the file stripe_count stripe_patter n Stripe count of the file Stripe pattern of the file 0 means success. A negative number means an error. This API opens/creates a file with specified striping parameters. Sun Microsystems, Inc. Managing Lustre TM Data Striping 21

22 Future Developments With the currently implemented striping disk format, ->obd_unpackmd() must have an end-to-end understanding of all possible combinations of layouts, i.e., the format is basically flat rather than hierarchical. To facilitate development of new layouts, the striping disk format will be adjusted so that higher layers (e.g., struct lov_mds_md) can be parsed without knowing the details of the lower layer (in this case, struct lov_ost_data) representation. A straightforward way to do this is to precede each layout descriptor with the standard header: struct md_layout_descriptor_header { u16 mldh_magic; u16 mldh_length; }; where ->mldh_magic identifies the layout type and is used to determine the ->obd_unpackmd() method to be called to parse the descriptor; and ->mldh_length is the total descriptor length, which is used by the upper layer to pass over lower layer descriptors without understanding details of their representation. Care must be taken, however, to avoid introducing too much redundant information to the on-disk EA for the most common uses. 22 Managing Lustre TM Data Striping Sun Microsystems, Inc.

23 Glossary ADIO CMD EA llite LOV MDS MGS MPI OSC OST Analog-to-digital I/O. The ADIO driver is an abstract-device interface for parallel I/O that is used by the MPI to implement its I/O library. Cluster metatdata Extended attribute Lustre client system Logical object volume Metadata server Management server Message Passing Interface Object server client Object storage server Sun Microsystems, Inc. Managing Lustre TM Data Striping 23

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc. European Lustre Workshop Paris, France September 2011 Hands on Lustre 2.x Johann Lombardi Principal Engineer Whamcloud, Inc. Main Changes in Lustre 2.x MDS rewrite Client I/O rewrite New ptlrpc API called

More information

File Systems for HPC Machines. Parallel I/O

File Systems for HPC Machines. Parallel I/O File Systems for HPC Machines Parallel I/O Course Outline Background Knowledge Why I/O and data storage are important Introduction to I/O hardware File systems Lustre specifics Data formats and data provenance

More information

Parallel I/O on Theta with Best Practices

Parallel I/O on Theta with Best Practices Parallel I/O on Theta with Best Practices Paul Coffman pcoffman@anl.gov Francois Tessier, Preeti Malakar, George Brown ALCF 1 Parallel IO Performance on Theta dependent on optimal Lustre File System utilization

More information

Remote Directories High Level Design

Remote Directories High Level Design Remote Directories High Level Design Introduction Distributed Namespace (DNE) allows the Lustre namespace to be divided across multiple metadata servers. This enables the size of the namespace and metadata

More information

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission Filesystem Disclaimer: some slides are adopted from book authors slides with permission 1 Recap Directory A special file contains (inode, filename) mappings Caching Directory cache Accelerate to find inode

More information

CMD Code Walk through Wang Di

CMD Code Walk through Wang Di CMD Code Walk through Wang Di Lustre Group Sun Microsystems 1 Current status and plan CMD status 2.0 MDT stack is rebuilt for CMD, but there are still some problems in current implementation. No recovery

More information

API and Usage of libhio on XC-40 Systems

API and Usage of libhio on XC-40 Systems API and Usage of libhio on XC-40 Systems May 24, 2018 Nathan Hjelm Cray Users Group May 24, 2018 Los Alamos National Laboratory LA-UR-18-24513 5/24/2018 1 Outline Background HIO Design HIO API HIO Configuration

More information

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files Addressable by a filename ( foo.txt ) Usually supports hierarchical

More information

SMD149 - Operating Systems - File systems

SMD149 - Operating Systems - File systems SMD149 - Operating Systems - File systems Roland Parviainen November 21, 2005 1 / 59 Outline Overview Files, directories Data integrity Transaction based file systems 2 / 59 Files Overview Named collection

More information

Scalable I/O. Ed Karrels,

Scalable I/O. Ed Karrels, Scalable I/O Ed Karrels, edk@illinois.edu I/O performance overview Main factors in performance Know your I/O Striping Data layout Collective I/O 2 of 32 I/O performance Length of each basic operation High

More information

Tutorial: Lustre 2.x Architecture

Tutorial: Lustre 2.x Architecture CUG 2012 Stuttgart, Germany April 2012 Tutorial: Lustre 2.x Architecture Johann Lombardi 2 Why a new stack? Add support for new backend filesystems e.g. ZFS, btrfs Introduce new File IDentifier (FID) abstraction

More information

Project Quota for Lustre

Project Quota for Lustre 1 Project Quota for Lustre Li Xi, Shuichi Ihara DataDirect Networks Japan 2 What is Project Quota? Project An aggregation of unrelated inodes that might scattered across different directories Project quota

More information

The JANUS Computing Environment

The JANUS Computing Environment Research Computing UNIVERSITY OF COLORADO The JANUS Computing Environment Monte Lunacek monte.lunacek@colorado.edu rc-help@colorado.edu What is JANUS? November, 2011 1,368 Compute nodes 16,416 processors

More information

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Data Analysis on Ranger, January 19, 2012

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Data Analysis on Ranger, January 19, 2012 Parallel I/O Steve Lantz Senior Research Associate Cornell CAC Workshop: Data Analysis on Ranger, January 19, 2012 Based on materials developed by Bill Barth at TACC 1. Lustre 2 Lustre Components All Ranger

More information

File System Implementation

File System Implementation Introduction to Operating Systems File System Implementation John Franco Electrical Engineering and Computing Systems University of Cincinnati Layered File System Application Programs Logical File System

More information

CLIO. Nikita Danilov Senior Staff Engineer Lustre Group

CLIO. Nikita Danilov Senior Staff Engineer Lustre Group CLIO Nikita Danilov Senior Staff Engineer Lustre Group 1 Problems with old client IO path Old code with a lot of obscurities; Inter-layer assumptions: lov, llite; Based on a huge obd-interface; Not easily

More information

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw High Level Architecture For UID/GID Mapping Revision History Date Revision Author 12/18/2012 1 jjw i Table of Contents I. Introduction 1 II. Definitions 1 Cluster 1 File system UID/GID 1 Client UID/GID

More information

Lustre Capability DLD

Lustre Capability DLD Lustre Capability DLD Lai Siyao 7th Jun 2005 OSS Capability 1 Functional specication OSS capabilities are generated by, sent to when opens/truncate a le, and is then included in each request from to OSS

More information

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1 Filesystem Disclaimer: some slides are adopted from book authors slides with permission 1 Storage Subsystem in Linux OS Inode cache User Applications System call Interface Virtual File System (VFS) Filesystem

More information

Example Implementations of File Systems

Example Implementations of File Systems Example Implementations of File Systems Last modified: 22.05.2017 1 Linux file systems ext2, ext3, ext4, proc, swap LVM Contents ZFS/OpenZFS NTFS - the main MS Windows file system 2 Linux File Systems

More information

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support Data Management Dr David Henty HPC Training and Support d.henty@epcc.ed.ac.uk +44 131 650 5960 Overview Lecture will cover Why is IO difficult Why is parallel IO even worse Lustre GPFS Performance on ARCHER

More information

Nathan Rutman SC09 Portland, OR. Lustre HSM

Nathan Rutman SC09 Portland, OR. Lustre HSM Nathan Rutman SC09 Portland, OR Lustre HSM Goals Scalable HSM system > No scanning > No duplication of event data > Parallel data transfer Interact easily with many HSMs Focus: > Version 1 primary goal

More information

Lustre overview and roadmap to Exascale computing

Lustre overview and roadmap to Exascale computing HPC Advisory Council China Workshop Jinan China, October 26th 2011 Lustre overview and roadmap to Exascale computing Liang Zhen Whamcloud, Inc liang@whamcloud.com Agenda Lustre technology overview Lustre

More information

Lustre Parallel Filesystem Best Practices

Lustre Parallel Filesystem Best Practices Lustre Parallel Filesystem Best Practices George Markomanolis Computational Scientist KAUST Supercomputing Laboratory georgios.markomanolis@kaust.edu.sa 7 November 2017 Outline Introduction to Parallel

More information

Inode. Local filesystems. The operations defined for local filesystems are divided in two parts:

Inode. Local filesystems. The operations defined for local filesystems are divided in two parts: Local filesystems Inode The operations defined for local filesystems are divided in two parts: 1. Common to all local filesystems are hierarchical naming, locking, quotas attribute management and protection.

More information

<Insert Picture Here> Btrfs Filesystem

<Insert Picture Here> Btrfs Filesystem Btrfs Filesystem Chris Mason Btrfs Goals General purpose filesystem that scales to very large storage Feature focused, providing features other Linux filesystems cannot Administration

More information

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Parallel Computing on Ranger and Lonestar, May 16, 2012

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Parallel Computing on Ranger and Lonestar, May 16, 2012 Parallel I/O Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar, May 16, 2012 Based on materials developed by Bill Barth at TACC Introduction: The Parallel

More information

we are here Page 1 Recall: How do we Hide I/O Latency? I/O & Storage Layers Recall: C Low level I/O

we are here Page 1 Recall: How do we Hide I/O Latency? I/O & Storage Layers Recall: C Low level I/O CS162 Operating Systems and Systems Programming Lecture 18 Systems October 30 th, 2017 Prof. Anthony D. Joseph http://cs162.eecs.berkeley.edu Recall: How do we Hide I/O Latency? Blocking Interface: Wait

More information

DNE2 High Level Design

DNE2 High Level Design DNE2 High Level Design Introduction With the release of DNE Phase I Remote Directories Lustre* file systems now supports more than one MDT. This feature has some limitations: Only an administrator can

More information

Using file systems at HC3

Using file systems at HC3 Using file systems at HC3 Roland Laifer STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu Basic Lustre

More information

INTERNAL REPRESENTATION OF FILES:

INTERNAL REPRESENTATION OF FILES: INTERNAL REPRESENTATION OF FILES: Every file on a UNIX system has a unique inode. The inode contains the information necessary for a process to access a file, such as file ownership, access rights, file

More information

UNIX File System. UNIX File System. The UNIX file system has a hierarchical tree structure with the top in root.

UNIX File System. UNIX File System. The UNIX file system has a hierarchical tree structure with the top in root. UNIX File System UNIX File System The UNIX file system has a hierarchical tree structure with the top in root. Files are located with the aid of directories. Directories can contain both file and directory

More information

BTREE FILE SYSTEM (BTRFS)

BTREE FILE SYSTEM (BTRFS) BTREE FILE SYSTEM (BTRFS) What is a file system? It can be defined in different ways A method of organizing blocks on a storage device into files and directories. A data structure that translates the physical

More information

The EXT2FS Library. The EXT2FS Library Version 1.37 January by Theodore Ts o

The EXT2FS Library. The EXT2FS Library Version 1.37 January by Theodore Ts o The EXT2FS Library The EXT2FS Library Version 1.37 January 2005 by Theodore Ts o Copyright c 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005 Theodore Ts o Permission is granted to make and distribute

More information

File Management 1/34

File Management 1/34 1/34 Learning Objectives system organization and recursive traversal buffering and memory mapping for performance Low-level data structures for implementing filesystems Disk space management for sample

More information

Parallel I/O Techniques and Performance Optimization

Parallel I/O Techniques and Performance Optimization Parallel I/O Techniques and Performance Optimization Lonnie Crosby lcrosby1@utk.edu NICS Scientific Computing Group NICS/RDAV Spring Training 2012 March 22, 2012 2 Outline Introduction to I/O Path from

More information

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018 Small File I/O Performance in Lustre Mikhail Pershin, Joe Gmitter Intel HPDD April 2018 Overview Small File I/O Concerns Data on MDT (DoM) Feature Overview DoM Use Cases DoM Performance Results Small File

More information

bytes per disk block (a block is usually called sector in the disk drive literature), sectors in each track, read/write heads, and cylinders (tracks).

bytes per disk block (a block is usually called sector in the disk drive literature), sectors in each track, read/write heads, and cylinders (tracks). Understanding FAT 12 You need to address many details to solve this problem. The exercise is broken down into parts to reduce the overall complexity of the problem: Part A: Construct the command to list

More information

The Journalling Flash File System

The Journalling Flash File System The Journalling Flash File System http://sources.redhat.com/jffs2/ David Woodhouse dwmw2@cambridge.redhat.com 1 The Grand Plan What is Flash? How is it used? Flash Translation Layer (FTL) NFTL Better ways

More information

What is a file system

What is a file system COSC 6397 Big Data Analytics Distributed File Systems Edgar Gabriel Spring 2017 What is a file system A clearly defined method that the OS uses to store, catalog and retrieve files Manage the bits that

More information

High Level Design IOD KV Store FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

High Level Design IOD KV Store FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O Date: January 10, 2013 High Level Design IOD KV Store FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O LLNS Subcontract No. Subcontractor Name Subcontractor Address B599860

More information

Application I/O on Blue Waters. Rob Sisneros Kalyana Chadalavada

Application I/O on Blue Waters. Rob Sisneros Kalyana Chadalavada Application I/O on Blue Waters Rob Sisneros Kalyana Chadalavada I/O For Science! HDF5 I/O Library PnetCDF Adios IOBUF Scien'st Applica'on I/O Middleware U'li'es Parallel File System Darshan Blue Waters

More information

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1 Filesystem Disclaimer: some slides are adopted from book authors slides with permission 1 Recap Blocking, non-blocking, asynchronous I/O Data transfer methods Programmed I/O: CPU is doing the IO Pros Cons

More information

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou, K. Magoutis, M. Marazakis, A. Bilas Institute of Computer Science (ICS) Foundation for Research and

More information

we are here I/O & Storage Layers Recall: C Low level I/O Recall: C Low Level Operations CS162 Operating Systems and Systems Programming Lecture 18

we are here I/O & Storage Layers Recall: C Low level I/O Recall: C Low Level Operations CS162 Operating Systems and Systems Programming Lecture 18 I/O & Storage Layers CS162 Operating Systems and Systems Programming Lecture 18 Systems April 2 nd, 2018 Profs. Anthony D. Joseph & Jonathan Ragan-Kelley http://cs162.eecs.berkeley.edu Application / Service

More information

The Journalling Flash File System

The Journalling Flash File System The Journalling Flash File System http://sources.redhat.com/jffs2/ David Woodhouse dwmw2@cambridge.redhat.com 1 The Grand Plan What is Flash? How is it used? Flash Translation Layer (FTL) NFTL Better ways

More information

Chapter 11: Implementing File-Systems

Chapter 11: Implementing File-Systems Chapter 11: Implementing File-Systems Chapter 11 File-System Implementation 11.1 File-System Structure 11.2 File-System Implementation 11.3 Directory Implementation 11.4 Allocation Methods 11.5 Free-Space

More information

File Systems. Chapter 11, 13 OSPP

File Systems. Chapter 11, 13 OSPP File Systems Chapter 11, 13 OSPP What is a File? What is a Directory? Goals of File System Performance Controlled Sharing Convenience: naming Reliability File System Workload File sizes Are most files

More information

Chapter 11: File System Implementation

Chapter 11: File System Implementation Chapter 11: File System Implementation Chapter 11: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

Welcome! Virtual tutorial starts at 15:00 BST

Welcome! Virtual tutorial starts at 15:00 BST Welcome! Virtual tutorial starts at 15:00 BST Parallel IO and the ARCHER Filesystem ARCHER Virtual Tutorial, Wed 8 th Oct 2014 David Henty Reusing this material This work is licensed

More information

The UNIX File System

The UNIX File System The UNIX File System Magnus Johansson (May 2007) 1 UNIX file system A file system is created with mkfs. It defines a number of parameters for the system as depicted in figure 1. These paremeters include

More information

Chapter 11: File System Implementation

Chapter 11: File System Implementation Chapter 11: File System Implementation Chapter 11: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University Chapter 11 Implementing File System Da-Wei Chang CSIE.NCKU Source: Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University Outline File-System Structure

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File-Systems, Silberschatz, Galvin and Gagne 2009 Chapter 11: Implementing File Systems File-System Structure File-System Implementation ti Directory Implementation Allocation

More information

CS 140 Project 4 File Systems Review Session

CS 140 Project 4 File Systems Review Session CS 140 Project 4 File Systems Review Session Prachetaa Due Friday March, 14 Administrivia Course withdrawal deadline today (Feb 28 th ) 5 pm Project 3 due today (Feb 28 th ) Review section for Finals on

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

ETFS Design and Implementation Notes#

ETFS Design and Implementation Notes# ETFS Design and Implementation Notes# One of the first questions that comes to mind when learning a new file system, is "What is the on disk file system structure?" ETFS does not have one. Well, not a

More information

Chapter 10: File System Implementation

Chapter 10: File System Implementation Chapter 10: File System Implementation Chapter 10: File System Implementation File-System Structure" File-System Implementation " Directory Implementation" Allocation Methods" Free-Space Management " Efficiency

More information

Input & Output 1: File systems

Input & Output 1: File systems Input & Output 1: File systems What are files? A sequence of (usually) fixed sized blocks stored on a device. A device is often refered to as a volume. A large device might be split into several volumes,

More information

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems Lustre Clustered Meta-Data (CMD) Huang Hua H.Huang@Sun.Com Andreas Dilger adilger@sun.com Lustre Group, Sun Microsystems 1 Agenda What is CMD? How does it work? What are FIDs? CMD features CMD tricks Upcoming

More information

The EXT2FS Library. The EXT2FS Library Version 1.38 June by Theodore Ts o

The EXT2FS Library. The EXT2FS Library Version 1.38 June by Theodore Ts o The EXT2FS Library The EXT2FS Library Version 1.38 June 2005 by Theodore Ts o Copyright c 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005 Theodore Ts o Permission is granted to make and distribute

More information

NFS in Userspace: Goals and Challenges

NFS in Userspace: Goals and Challenges NFS in Userspace: Goals and Challenges Tai Horgan EMC Isilon Storage Division 2013 Storage Developer Conference. Insert Your Company Name. All Rights Reserved. Introduction: OneFS Clustered NAS File Server

More information

Recent developments in GFS2. Steven Whitehouse Manager, GFS2 Filesystem LinuxCon Europe October 2013

Recent developments in GFS2. Steven Whitehouse Manager, GFS2 Filesystem LinuxCon Europe October 2013 Recent developments in GFS2 Steven Whitehouse Manager, GFS2 Filesystem LinuxCon Europe October 2013 Topics Principles of operation Locking Hints and Tips Inodes, Directories and System files NFS/Samba

More information

CS 537 Fall 2017 Review Session

CS 537 Fall 2017 Review Session CS 537 Fall 2017 Review Session Deadlock Conditions for deadlock: Hold and wait No preemption Circular wait Mutual exclusion QUESTION: Fix code List_insert(struct list * head, struc node * node List_move(struct

More information

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University File System Case Studies Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Today s Topics The Original UNIX File System FFS Ext2 FAT 2 UNIX FS (1)

More information

An Exploration of New Hardware Features for Lustre. Nathan Rutman

An Exploration of New Hardware Features for Lustre. Nathan Rutman An Exploration of New Hardware Features for Lustre Nathan Rutman Motivation Open-source Hardware-agnostic Linux Least-common-denominator hardware 2 Contents Hardware CRC MDRAID T10 DIF End-to-end data

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 24 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Questions from last time How

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel I/O (I) I/O basics Fall 2010 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card

More information

Lustre Data on MDT an early look

Lustre Data on MDT an early look Lustre Data on MDT an early look LUG 2018 - April 2018 Argonne National Laboratory Frank Leers DDN Performance Engineering DDN Storage 2018 DDN Storage Agenda DoM Overview Practical Usage Performance Investigation

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

CSE 509: Computer Security

CSE 509: Computer Security CSE 509: Computer Security Date: 2.16.2009 BUFFER OVERFLOWS: input data Server running a daemon Attacker Code The attacker sends data to the daemon process running at the server side and could thus trigger

More information

DLD for OPEN HANDLING in CMD

DLD for OPEN HANDLING in CMD DLD for OPEN HANDLING in CMD Huang Hua Jul 5, 2006 Contents 1 Introduction 2 2 Functional Specification 2 2.1 Abstract.............................. 2 2.2 Data structures......................... 2 2.2.1

More information

Coordinating Parallel HSM in Object-based Cluster Filesystems

Coordinating Parallel HSM in Object-based Cluster Filesystems Coordinating Parallel HSM in Object-based Cluster Filesystems Dingshan He, Xianbo Zhang, David Du University of Minnesota Gary Grider Los Alamos National Lab Agenda Motivations Parallel archiving/retrieving

More information

CSE 333 SECTION 3. POSIX I/O Functions

CSE 333 SECTION 3. POSIX I/O Functions CSE 333 SECTION 3 POSIX I/O Functions Administrivia Questions (?) HW1 Due Tonight Exercise 7 due Monday (out later today) POSIX Portable Operating System Interface Family of standards specified by the

More information

Chapter 11: Implementing File

Chapter 11: Implementing File Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

The UNIX File System

The UNIX File System The UNIX File System Magnus Johansson May 9, 2007 1 UNIX file system A file system is created with mkfs. It defines a number of parameters for the system, such as: bootblock - contains a primary boot program

More information

Logical disks. Bach 2.2.1

Logical disks. Bach 2.2.1 Logical disks Bach 2.2.1 Physical disk is divided into partitions or logical disks Logical disk linear sequence of fixed size, randomly accessible, blocks disk device driver maps underlying physical storage

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

Fall 2017 :: CSE 306. File Systems Basics. Nima Honarmand

Fall 2017 :: CSE 306. File Systems Basics. Nima Honarmand File Systems Basics Nima Honarmand File and inode File: user-level abstraction of storage (and other) devices Sequence of bytes inode: internal OS data structure representing a file inode stands for index

More information

File System Implementation

File System Implementation File System Implementation Last modified: 16.05.2017 1 File-System Structure Virtual File System and FUSE Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance. Buffering

More information

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition Chapter 11: Implementing File Systems Operating System Concepts 9 9h Edition Silberschatz, Galvin and Gagne 2013 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory

More information

File Systems: Consistency Issues

File Systems: Consistency Issues File Systems: Consistency Issues File systems maintain many data structures Free list/bit vector Directories File headers and inode structures res Data blocks File Systems: Consistency Issues All data

More information

ECE 598 Advanced Operating Systems Lecture 19

ECE 598 Advanced Operating Systems Lecture 19 ECE 598 Advanced Operating Systems Lecture 19 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 7 April 2016 Homework #7 was due Announcements Homework #8 will be posted 1 Why use

More information

OCFS2 Mark Fasheh Oracle

OCFS2 Mark Fasheh Oracle OCFS2 Mark Fasheh Oracle What is OCFS2? General purpose cluster file system Shared disk model Symmetric architecture Almost POSIX compliant fcntl(2) locking Shared writeable mmap Cluster stack Small, suitable

More information

An Overview of The Global File System

An Overview of The Global File System An Overview of The Global File System Ken Preslan Sistina Software kpreslan@sistina.com David Teigland University of Minnesota teigland@borg.umn.edu Matthew O Keefe University of Minnesota okeefe@borg.umn.edu

More information

DAOS Lustre Restructuring and Protocol Changes Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

DAOS Lustre Restructuring and Protocol Changes Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O Date: May 26th, 2014 DAOS Lustre Restructuring and Protocol Changes Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O LLNS Subcontract No. Subcontractor Name Subcontractor

More information

HLD For SMP node affinity

HLD For SMP node affinity HLD For SMP node affinity Introduction Current versions of Lustre rely on a single active metadata server. Metadata throughput may be a bottleneck for large sites with many thousands of nodes. System architects

More information

GridNFS: Scaling to Petabyte Grid File Systems. Andy Adamson Center For Information Technology Integration University of Michigan

GridNFS: Scaling to Petabyte Grid File Systems. Andy Adamson Center For Information Technology Integration University of Michigan GridNFS: Scaling to Petabyte Grid File Systems Andy Adamson Center For Information Technology Integration University of Michigan What is GridNFS? GridNFS is a collection of NFS version 4 features and minor

More information

Chapter 12 File-System Implementation

Chapter 12 File-System Implementation Chapter 12 File-System Implementation 1 Outline File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance Recovery Log-Structured

More information

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24 FILE SYSTEMS, PART 2 CS124 Operating Systems Fall 2017-2018, Lecture 24 2 Last Time: File Systems Introduced the concept of file systems Explored several ways of managing the contents of files Contiguous

More information

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following: CS 470 Spring 2018 Mike Lam, Professor Distributed Web and File Systems Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapters

More information

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD. OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD. File System Implementation FILES. DIRECTORIES (FOLDERS). FILE SYSTEM PROTECTION. B I B L I O G R A P H Y 1. S I L B E R S C H AT Z, G A L V I N, A N

More information

grib_api.h File Reference

grib_api.h File Reference grib_api.h File Reference Copyright 2005-2013 ECMWF. More... Defines #define GRIB_API_VERSION (GRIB_API_MAJOR_VERSION*10000+G RIB_API_MINOR_VERSION*100+GRIB_API_REVISION_VERSI ON) #define GRIB_SECTION_PRODUCT

More information

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition Chapter 12: File System Implementation Silberschatz, Galvin and Gagne 2013 Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Silberschatz, Galvin and Gagne 2013 Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods

More information

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions Roger Goff Senior Product Manager DataDirect Networks, Inc. What is Lustre? Parallel/shared file system for

More information

Radix Tree, IDR APIs and their test suite. Rehas Sachdeva & Sandhya Bankar

Radix Tree, IDR APIs and their test suite. Rehas Sachdeva & Sandhya Bankar Radix Tree, IDR APIs and their test suite Rehas Sachdeva & Sandhya Bankar Introduction Outreachy intern Dec 2016-March 2017 for Linux kernel, mentored by Rik van Riel and Matthew Wilcox. 4th year undergrad

More information

[537] Journaling. Tyler Harter

[537] Journaling. Tyler Harter [537] Journaling Tyler Harter FFS Review Problem 1 What structs must be updated in addition to the data block itself? [worksheet] Problem 1 What structs must be updated in addition to the data block itself?

More information

Operating Systems Design Exam 2 Review: Spring 2011

Operating Systems Design Exam 2 Review: Spring 2011 Operating Systems Design Exam 2 Review: Spring 2011 Paul Krzyzanowski pxk@cs.rutgers.edu 1 Question 1 CPU utilization tends to be lower when: a. There are more processes in memory. b. There are fewer processes

More information