vsphere Flash Device Support October 31, PDF Free Download

October 31, 2017 1

Table of Contents 1. vsphere Flash Device Support - Main 1.1.Target Audience 1.2.Resolution 1.3.SSD and Flash Device Use Cases 1.4.SSD Endurance Criteria 1.5.SSD Selection Requirements 1.6.ESXi Coredump Device Usage Model 1.7.VMware Support Policy 2. vsphere Flash Device Support - Appendices 2.1.Appendix 1: ESXi Logging Sequential Requirement 2.2.Appendix 2: OEM Pre-Install Coredump Size 2.3.Appendix 3: Resizing Coredump Partitions 2.4.Appendix 4: Creating a Logging Partition 2

1. vsphere Flash Device Support - Main This article provides guidance for SSD and Flash Device usage with vsphere, basic requirements as well as recommendations specific use cases. 3

1.1 Target Audience Customer: Ensure that vsphere hosts are populated with flash and SSD devices that meet the required size and endurance criteria as set forth in Table-1 for the various use cases. Also, for devices used for the coredump use case in the case of hosts with a large amount of system memory or when VSAN is in use, ensure that the device size matches guidance in Table-2 and that actual size of the actual coredump partition is adequate. System Vendor: Ensure that certified servers for vsphere use supported flash and SSD devices that meet the required size and endurance criteria as set forth in Table-1 for the various use cases. For systems with large memory as well as for VSAN Ready Nodes, vendors should take care to size flash or SSD devices used for the coredump use case as specified in Table-2 to ensure adequate operation in the event of a system crash and to ensure that the coredump partition is correctly sized as default settings may need to be overridden. For USB factory pre-installs of ESXi on these systems consult Appendix 2. Flash Device Vendor: Ensure that your SSD and Flash drives types meet the required endurance criteria as set forth in Table-1 for the various use cases. For the logging partition use case, ensure that recommended low cost flash devices have sufficient sequential endurance to meet the alternative requirement outlined in Appendix 1. 1.2 Resolution VMware vsphere ESXi can use locally attached SSDs (Solid State Disk) and flash devices in multiple ways. Since SSDs offer much higher throughput and much lower latency than traditional magnetic hard disks the benefits are clear. While offering lower throughput and higher latency, flash devices such as USB or SATADOM can also be appropriate for some use cases. The potential drawback to using SSDs and flash device storage is that the endurance can be significantly less than traditional magnetic disks and it can vary based on the workload type as well as factors such as the drive capacity, underlying flash technology, etc. This white paper outlines the minimum SSD and flash device recommendations based on different technologies and use case scenarios. 1.3 SSD and Flash Device Use Cases A non-exhaustive survey of various usage models in vsphere environment are listed below. Host swap cache This usage model has been supported since vsphere 5.1 for SATA and SCSI connected SSDs. USB and low end SATA or SCSI flash devices are not supported. The workload is heavily influenced by the degree of host memory over commitment. Regular datastore A (local) SSD is used instead of a hard disk drive. This usage model has been supported since vsphere 6.0 for SATA and SCSI connected SSDs. There is currently no support for USB connected SSDs or for low end flash devices regardless of connection type. vsphere Flash Read Cache (aka Virtual Flash) vsan This usage model has been supported since vsphere 5.5 for SATA and SCSI connected SSDs. There is no support for USB connected SSDs or for low end flash devices. This usage model has been supported since vsphere 5.5 for SATA and SCSI SSDs. 4

vsan Hardware Quick Reference Guide should be consulted for detailed requirements. vsphere ESXi Boot Disk A USB flash drive or SATADOM or local SSD can be chosen as the install image foresxi, the vsphere hypervisor, which then boots from the flash device. This usage model has been supported since vsphere 3.5 for USB flash devices and vsphere 4.0 for SCSI/SATA connected devices. Installation to SATA and SCSI connected SSD, SATADOM and flash devices creates a full install image which includes a logging partition (see below) whereas installation to a USB device creates a boot disk image without a logging partition. vsphere ESXi Coredump device The default size for the large coredump partition is 2.5 GiB which is about 2.7 GB and the installer creates a large coredump partition on the boot device for vsphere 5.5 and above. After installation the partition can be resized if necessary; consult Appendix 3 for detailed remediation steps. Any SATADOM or SATA/SCSI SSD may be configured with a coredump partition. In coming release of vsphere non-boot USB flash devices may also be supported. This usage model has been supported from vsphere 3.5 for boot USB flash devices and since vsphere 4.0 for any SATA or SCSI connected SSD that is local. This usage model also applies to Autodeploy hosts which have no boot disk. vsphere ESXi Logging device A SATADOM or local SATA/SCSI SSD is chosen as the location for the vsphere logging partition (aka, /scratch partition ). This partition may be but need not be on the boot disk and this applies to Autodeploy hosts which lack a boot disk. This usage model has been supported since vsphere 6.0 for any SATA or SCSI connected SSD that is local. SATADOMs that meet the requirement set forth in Table 1 are also supported. This usage model can be supported in a future release of vsphere for USB flash devices that meet the requirement set forth in Table 1. 1.4 SSD Endurance Criteria The flash industry often uses Tera Bytes Written (TBW) as a benchmark for SSD endurance. TBW is the number of terabytes that can be written to the device over its useful life. Most devices have distinct TBW ratings for sequential and random IO workloads, with the latter being much lower due to WAF (defined below). Other measures of endurance commonly used are DWPD (Drive Writes Per Day) and P/E (Program/Erase) cycles. Conversion formulas are provided here for the reader s convenience: Converting DWPD (Drive Writes Per Day) TBW (Terabytes Written): TBW = DWPD * Warranty (in Years) * 365 * Capacity (in GB) / 1,000 (GB per TB) Converting Flash P/E Cycles per Cell TBW (Terabytes Written): TBW = Capacity (in GB) * (P/E Cycles per Cell) / (1,000 (GB per TB) * WAF) WAF (Write Amplification Factor) is a measure of the induced writes caused by inherent properties of flash technology. Due to the difference between the storage block size (512 bytes), the flash cell size (typically 4KiB or 8KiB bytes) and the minimum flash erase size of many cells one write can force a number of induced writes due to copies, garbage collection, etc. For sequential workloads typical WAFs fall in the range of single digits while for random workloads WAFs can approach or even exceed 100. Table 1 contains workload characterization for the various workloads excepting the Datastore and vsphere Flash Read Cache workloads which depend on the characteristics of the Virtual Machines workloads being run and thus cannot be characterized here. A WAF from the table can be used with the above P/E -> TBW formula. 1.5 SSD Selection Requirements 5

Performance and endurance are critical factors when selecting SSDs. For each of the above use cases, the amount and frequency of data written to the SSD or flash device determines the minimum requirement for performance and endurance by ESXi. In general, SSDs can be deployed in all of the above use cases, but (low end) flash devices including SATADOM can only be deployed in some. In the table below: ESXi write endurance requirements are stated in terms of Terabytes written (TBW) for a JEDEC random workload. There are no specific ESXi performance requirements, but products built on top of ESXi such as VSAN may have their own requirements. Table 1: SSD/Flash Endurance Requirements 1. For SSD sizes over 1 TB the endurance should grow proportionally (e.g., 7300 TBW for a 2 TB) 2. Endurance requirement normalized to JEDEC random for an inherently sequential workload 3. Only 4 GB of device is used, so a 16 GB device need only support 25% as many P/ E cycles. 4. Default coredump partition size is 2.7 GB. See Table 2 for detailed size requirements. When boot and coredump devices are colocated1 the boot device endurance requirement will suffice. 5. Failure of the ESXi boot and/or coredump devices is catastrophic for vsphere, hence the higher requirement as an extra margin of safety when logging device is collocated(1) with one or both. 6. Future release of vsphere may require higher TBW for its boot device. It is highly recommended for future looking system to have 2 TBW endurance requirements for vsphere boot device. IMPORTANT: ALL of the TBW requirements in Table 1 are stated in terms of the JEDEC Enterprise Random Workload(2) because vendors commonly publish only a single endurance number, the random TBW. Vendors may provide a sequential number if asked and such a number together with a measured or worst case WAF can be used to calculate an alternative sequential TBW if the total workload writes in 5 years are known. Failure of the boot or coredump device is catastrophic for vsphere so VMware requires use of the random TBW requirement for the boot and coredump use cases. Appendix 1 describes in detail how to do the calculation for the logging device use case. (1) Collocated refers to the case where 2 use cases are partitions on the same device, thereby sharing flash cells. 6

(2) See JESD218A and JESD219 for the Endurance Test Method and Enterprise Workload definitions, respectively. 1.6 ESXi Coredump Device Usage Model The size requirement for the ESXi coredump device scales with the size of the host DRAM and also usage of VSAN. vsphere ESXi installations with an available local datastore are advised to use dump to file which automatically reserves needful space on the local datastore but flash media in general and installations using VSAN in particular will often lack a local datastore and thus require a coredump device. While the default size of 2560 MiB suffices for a host with 1 TiB of DRAM not running VSAN, if VSAN is in use the default size is very often insufficient. Table 2 gives the recommended partition size in units of MiB and corresponding flash drive size recommendation. If these recommendations are ignored and ESXi crashes then the coredump may be truncated. The footnotes explain the calculation, and note that if using VSAN the values from the right side of the table must be added to those from the left side of the table. To override the default or change coredump partition size after installing consult the appendices. Table 2: Coredump Partition Size Parameter and Size Requirement(4) as a Function of both Host DRAM Size and (if applicable) VSAN Caching Tier Size 1. 2560 is the default so no parameter is required for systems without VSAN with up to 1 TiB of DRAM or with VSAN with up to 512 GiB of DRAM and 250 GB of SSDs in the Caching Tier. 2. Due to GiB to GB conversion 6 and 12 TiB DRAM sizes require the next larger flash device to accommodate the coredump partition. Provided sizes will also accommodate colocating the boot device and the coredump device on the same physical flash drive. 3. Sizes in these columns must be added to sizes from left hand side of table. For example, a host with 4 TiB of DRAM and 4 TB of SSD in the VSAN Caching Tier requires a flash device size of at least 24 GB (16 GB + 8 GB) and a coredump partition size of 15360 MiB (10240 + 5120). 4. Coredump device usage is very infrequent so TBW requirement is unchanged from Table 1. 1.7 VMware Support Policy In general, if the SSD s host controller interface is supported by a certified IOVP driver, then the SSD drive is supported for ESXi provided that the media meets the endurance requirements above. Therefore, there are no specific vsphere restrictions against SATADOM and M.2 provided, again, that they adhere to the endurance requirements set forth in Table 1 above. For USB storage devices (such as flash drives, SD cards plus readers, and external disks of any kind) the drive vendor must work directly with system manufacturers to ensure that the drives are supported for these systems. USB flash devices and SD cards plus readers are qualified pairwise with USB host controllers and it is possible for a device to fail certification with one host controller but pass with another. VMware strictly recommends that customers who do not have a preinstalled system either obtain a USB flash drive directly from their OEM vendor or purchase a model that has been certified for use with their server. 7

2. vsphere Flash Device Support - Appendices Supplemental information. 8

2.1 Appendix 1: ESXi Logging Sequential Requirement As noted previously, a logging partition is created automatically when ESXi is installed on all non-usb media. No distinction is made as to whether this logging partition is located on a magnetic disk or on SSDs or flash, but when the partition is on flash care must be taken to ensure that device endurance is sufficient. Thus either the sequential workload endurance requirement derived below or the random workload requirement from Table 1 must be met. It cannot be stressed enough that this procedure can only be applied to sequential workloads such as the logging device workload where the worst case WAF is < 100 (block mode) and < 10 (page mode). WAF values for the JEDEC random workload are much larger so a device random TBW capability is a conservative indicator but it is one to which the raw device writes for a sequential workload can be compared directly without considering WAF. To apply this method, you must obtain from the flash vendor a theoretical sequential TBW and WAF. The raw workload has been measured to be 1.5 GB per hour or just over 64 TBW in 5 years without any write amplification. This number is not a worst case number and when the logging device is colocated with the boot and/or coredump device use cases VMware requires that a value of 128 TBW in 5 years shall be used as a worst case value, again without any WAF, to provide a margin of safety due to the catastrophic nature of device failure. These values are provided in Table 1 for comparison directly to the vendor published JEDEC random TBW but such a comparison is conservative due to the greater WAF for the JEDEC random workload. A more aggressive comparison can be done if the workload WAF for a proposed device has been measured under realistic load on an active ESXi cluster. For a flash device where the logging device WAF under load is measured to be <= 8 the sequential requirement for the logging device workload on a dedicated device is 64 TBW * 8 WAF = 512 TBW. When colocated with boot and/or coredump devices the requirement is 128 TBW * 8 WAF = 1024 TBW. Consult your flash vendor for assistance with measuring WAF or use the worst case WAF figures below. The logging partition is formatted with FAT16 which is similar enough to FAT32 so that most flash devices should handle it equally well. ESXi does not issue TRIM or SCSI UNMAP commands with FAT16. By volume of IO the workload is 75% small writes and 25% large writes(3) with 90% of IO write operations small writes which have a higher WAF, so we focus here on small writes. Once a log file is 1 MB in size it is compressed and deleted. The 8 most recent compressed log files of each type, of which there are at most 16, are retained, for a total of 16 uncompressed and 128 compressed log files. The small writes have an average size close to 128 bytes and are strictly sequential as they append consecutive lines of text to a log file. They write the same disk block (512 bytes) and flash cell repeatedly as lines are appended and they overwrite entire flash cells and likely erase blocks, significantly different from the JEDEC random workload. If VSAN is in use then the VSAN trace files will also be written to the logging partition and the trace file write load is included in the above TBW figure, but as noted the bulk of the writes are small and thus this discussion focuses on the small repeated writes of single disk blocks. For a flash device operating in block mode each write of a single disk block will consume an entire flash cell of 4K bytes. For a flash device operating in page mode different writes of single disk blocks can share a flash cell and each block will be written to a different disk block offset in the flash cell that has been previously erased and not subsequently reused.(4) For a block mode flash device to fill a flash cell of size 4K bytes (8 disk blocks) the workload will write 32 disk blocks. So in block mode 32 flash cells will be written in the worst case. Similarly, with a flash cell size of 8K bytes, 64 writes will be required. Since 75% of the writes are small writes the worst case WAF is 24 for block mode flash devices with a flash cell size of 4K bytes and 48 with a flash cell size of 8K bytes, but this is only for WAF directly due to the workload. For page mode flash devices, each flash cell is filled densely, so 32 disk blocks will fit in 4 flash cells of size 4K bytes and 64 disk blocks will fit in 4 flash cells of size 8K bytes. A 5th write will likely be needed as part of garbage collection to copy over the final versions of the disk blocks since the final disk 9

blocks will be interleaved with overwritten ones. Again 75% of 5 is about 4 so for a page mode flash device the worst case WAF directly due to the workload is 4. IMPORTANT: If the block mode flash device has an internal cache and can support up to 16 simultaneous IO streams then direct WAF may be greatly reduced. Consult your flash vendor. Once the workload specific WAF for a proposed device has been determined either by direct measurement or using the appropriate worst case value from above then the actual TBW for 5 years will be n * (WAF for proposed device with the logging workload)(5) where n is 128 TBW if the logging device is colocated with the boot and/or coredump devices and 64 TBW if not. To find the scaling factor divide the (WAF for a proposed device with the logging workload) by either 1.1 or the vendor quoted theoretical sequential WAF. For example, 24 (block mode with 4K page size worst case WAF) divided by 1.1 is 22. Multiply the requirement (128 TBW if colocated, 64 TBW if not) by the scaling factor of 22. Since 128 * 22 ~= 2816 TBW, 2820 TBW is the adjusted theoretical sequential TBW for colocated logging workload on this device. Verify that the requirement times the theoretical sequential WAF (2820 TBW * 1.1 WAF = 3102) exceeds raw workload writes times the applicable WAF (128 TBW * 24 WAF = 3072). Compare this requirement to the vendor s theoretical sequential TBW value for the proposed device. VMware does not currently support post-installation configuration of a logging partition on USB media but may in a future release (see Appendix 4). (3) With VSAN in use the split is roughly 3 to 1 by volume of IOs. When VSAN is not in use there are fewer large IOs. (4) Newer technologies may have additional factors not considered here. Please consult your flash vendor. (5) By (WAF for proposed device with the logging workload) is meant, in the absence of a measurement, either 24 or 48 for a block mode flash device with page size of 4K bytes or 8K bytes respectively, or 4 for a page mode flash device. WAF values from actual measurement under realistic workload conditions are preferred. 2.2 Appendix 2: OEM Pre-Install Coredump Size For USB factory pre-installs (also known as a dd-image ) and also in a vsphere Auto Deploy environment the coredump partition size can be provided on first boot with this syntax: autopartitiondiskdumppartitionsize=5120 where 5120 is twice the default in MiB units. When preparing the dd-image for systems with large memory or intended for customers who will use VSAN, OEMs and partners should loopback mount the dd-image and edit the boot.cfg file in the bootbank (partition 5) and add the autopartitiondiskdumppartitionsize option with an option value to the ESXi boot line so that it will be parsed at first boot. No value need be specified for the diskdumpslotsize since the autopartitiondiskdumppartitionsize value will default the slot size to the same size, resulting in a larger coredump partition with a single slot. Although no further action is needed from the OEM, several points should be emphasized to customers: Customers should choose to upgrade rather install when upgrading to a newer version of vsphere ESXi as installing will unconditionally replace the larger than default large coredump partition with a large coredump partition of the default size 2560 MiB, undoing any first boot work. On subsequent boots the autopartitiondiskdumppartitionsize option will have no effect and thus will not work if, for example, a customer deletes an existing coredump partition. Customers can remediate this situation using steps in Appendix 3. Customers who accidentally choose to install or who otherwise need to manually resize their coredump partition can remediate the situation using the steps in Appendix 3. IMPORTANT: In current releases of vsphere upgrade is NOT the default option when using the ISO installer. This issue will be addressed in a forthcoming release of vsphere. 10

IMPORTANT: In vsphere 5.5 and 6.0 an upgrade of USB factory pre-installs (aka, the ddimage ) may not be possible in the presence of a large coredump partition regardless of size (i.e., this issue exists even with a default sized large coredump partition). The failure manifests as a failure in the ISO upgrade GUI with a message that the vmkfstools has failed with an error. This issue is being resolved on patches to both 5.5 and 6.0 release vsphere. Customers who encounter this issue and wish to upgrade should retry with the most up to date patch for their version of vsphere. 2.3 Appendix 3: Resizing Coredump Partitions An ESXi instance can have a multiplicity of coredump partitions but at any time only one can be active. By default an ESXi installation will have two coredump partitions, a legacy partition of size 110 MiB which is adequate for operation in Maintenance Mode and a large partition which by default is of size 2.5 GiB and which is required to be used when not in Maintenance Mode. While increasing size of the large coredump partition we make the legacy one active for safety. A coredump partition may hold multiple coredumps with each coredump occupying a slot but coredump partitions on local flash media are best configured with a single slot. After resizing the slot size can be specified on the ESXi boot line with this syntax: diskdumpslotsize=5120 where 5120 MiB is twice the default and the value should match your enlargement. IMPORTANT: To manually resize the coredump partition place the host in Maintenance Mode. IMPORTANT: After resizing the coredump partition the ESXi host must be rebooted with the appropriate diskdumpslotsize specified to finish the resizing operation. Is the boot media coredump partition active? Starting with vsphere 5.5 ESXi images have a large coredump partition of size 2.5 GiB (2.7 GB), but on USB media in particular it may not be present and if present it may not be active. Here is an example showing a host with no active coredump partition on the USB flash boot media: In this case with a pre-install dd-image on a USB flash drive a coredump partition on a local SCSI disk was available and used and no coredump partition was created on first boot. If the local SCSI disk is later used for VSAN or removed altogether a coredump partition will not be created. The next section gives a procedure for creating a large coredump partition when none exists. Creating a coredump partition on USB boot media IMPORTANT: This procedure is only applicable to vsphere ESXi 5.5 and later on USB boot media with MBR partition table type. This is the default for OEM pre-install dd-images. To check if the partition table type of the boot device is MBR (aka msdos ) use this command: 11

Note that the previous section gives commands to determine the boot device. Customers who for whatever reason (e.g., upgrade from an install image of a previous version of vsphere ESXi) have a GPT partition table type on their boot device and lack a large coredump partition should contact VMware support for assistance. As noted above, customers with OEM pre-install dd-images may not have a large coredump partition. If the boot device partition table type is MBR then fdisk can be used to create a coredump partition. Default units of fdisk on ESXi are cylinders, so use the u command to switch units to sectors before using n command to create a primary partition with number 2, default start sector and end sector of +5242879 which is one less than twice the default size in KiB since a disk sector of 512 bytes is half of 1024 (2560 MiB == 2621440 KiB). Next use t command to set type to fc and then write out the partition table with the w command. Verify partition creation and activate the new coredump partition with these commands: IMPORTANT: After creation reboot and specify diskdumpslotsize to ensure correct slot size. Resizing a coredump partition on USB boot media As noted above the default coredump partition size may not be sufficient. Table 2 provides the needful size for various system configurations. PartedUtil can be used on both MBR and GPT partition table types so this procedure can be used on all ESXi install types, but for OEM preinstall dd-images the coredump partition will be partition number 2 due to limitations of MBR. Here we increase the coredump partition size from 2560 MiB to 15360 MiB for a host with 4 TiB of DRAM and 4 TB of SSDs in the VSAN Caching Tier and a 32 GB USB flash (or SD card) boot device. This example is for an install image; for a dd-image substitute 2 for 9 for the partition. 12

IMPORTANT: After resizing reboot and specify the diskdumpslotsize to ensure correct slot size. Resizing a coredump partition on other boot media Excepting installations to USB devices a vsphere ESXi install device will have a partition for the logging and unless the flash device is small a datastore partition as well. If the latter is present it may be needful to delete or resize the datastore partition to free up space. The logging partition may be relocated or deleted depending upon requirements. Here we delete the datastore partition and relocate the logging partition for an ESXi installation on a local SCSI disk (naa.0123456789abcdef). We begin by repointing /scratch at /tmp since we will be relocating the underlying partition. 13

Now that space is available the coredump partition can be grown using the technique described in the previous section ( Creating a coredump partition on USB boot media ). Reconfiguration and a reboot are required to use the relocated logging partition; for details see VMware KB article Creating a persistent scratch location for ESXi 4.x/5.x/6.x (1033696). A reboot is also required to use a resized coredump partition and the reboots can be combined into one at the end of all the partition manipulation described here. 2.4 Appendix 4: Creating a Logging Partition IMPORTANT: This procedure is only applicable to vsphere ESXi 5.5 and later on USB boot media with MBR partition table type. To check the partition table type on an ESXi installation see the section Creating a coredump partition on USB boot media in Appendix 3. It will be useful if the reader studies that section in detail before continuing here. IMPORTANT: The USB flash device must meet the endurance requirement set forth in Table 1. VMware strictly recommends that the USB flash device be provided by the host system OEM and certified by the OEM for logging device usage model. See VMware Support Policy above. The default units of fdisk on ESXi are cylinders, so use the u command to switch units to sectors before using n command to create a primary partition with number 3, default start sector and end sector of +8386559 which is one less than the maximum number of 512 byte sectors supported in a FAT16 volume. Next use t command to set the partition type to 6 and then write out the partition table with the w command. Verify and format the new partition with these commands: 14

Reconfiguration and a reboot are required to use the newly created partition; for details see VMware Knowledge Base article Creating a persistent scratch location for ESXi 4.x/5.x/6.x (1033696). 15

vsphere Flash Device Support October 31, 2017