STO1926BU A Day in the Life of a VSAN I/O Diving in to the I/O Flow of vsan John Nicholson (@lost_signal) Pete Koehler (@vmpete) VMworld 2017 Content: Not for publication #VMworld #STO1926BU
Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. 2
vsan Objects and Components 3
vsan s and components C1 RAID-0 C2 C3 Components C1 Object RAID-1 RAID-0 C2 C3 Components 700GB RAID-1 FTT=1 W RAID-0 Components The vsan datastore is an store Object store allows you to meet granular availability and performance requirements Each made up of one or more components Data (components) is distributed across cluster based on VM storage policy Copy of Object Copy of Object
Virtual Machine as a set of Objects on VSAN Snapshot VM Home VM Swap VMDK Snap delta Snap memory VM Home Namespace VM Swap Object Virtual Disk (VMDK) Object Snapshot (delta) Object Snapshot (delta) Memory Object 5
Applying performance and protection policies to s What If APIs VMworld 2017 Content: Not for Policies define levels of protection and performance Applied at a per VM level, or per VMDK level vsan currently provides 10 unique storage capabilities to vcenter Server publication 6
Number of Failures to Tolerate (How many copies of your data?) esxi-01 esxi-02 esxi-03 RAID-1 FTT=1 ~50% of I/O ~50% of I/O esxi-04 witness FTT defines the number of hosts, disk or network failures a storage can tolerate. For n failures tolerated, n+1 copies of the are created and 2n+1 host contributing storage are required! Primary Failures to Tolerate (PFTT) defines the number of sites that can accept failure. (0, 1) Secondary Failures to Tolerate (SFTT) defines the number within a site that can accept failure (0, 1, 2, 3)
Number of Disk Stripes Per Object (on how many devices?) RAID-0 esxi-01 esxi-02 esxi-03 RAID-1 RAID-0 stripe-1a stripe-1b FTT=1 Stripe width=2 stripe-2b stripe-2a esxi-04 witness Defines the minimum number of capacity devices across which each replica of a storage is distributed. Higher values may result in better performance. Stripe width can improve performance of write Destaging, and fetching of reads Higher values may put more constraints on flexibility of meeting storage compliance policies To be used only if performance is an issue
vsan Fault Domains FD1 RAID-1 esxi-01 esxi-02 Rack FTT=1 FD2 FD3 FD4 esxi-03 esxi-05 esxi-07 esxi-04 esxi-06 esxi-08 witness Rack Rack Rack Create fault domains to increase availability Protect against rack failure, etc. Example: Four defined fault domains FD1 = esxi-01, esxi-02 FD2 = esxi-03, esxi-04 FD3 = esxi-05, esxi-06 FD4 = esxi-07, esxi-08 Cluster can tolerate single rack failure in illustrated scenario 9
Nested fault domains Remote Protection for Stretched Clusters RAID-6 Cluster vsphere RAID-1 5ms RTT, 10GbE 3 rd site for witness vsan RAID-6 VMworld 2017 Content: Not for Cluster Redundancy locally and across sites With site failure, vsan maintains availability with local redundancy in surviving site publication No change in stretched cluster configuration steps Optimized site locality logic to minimize I/O traffic across sites
vsan I/O path explained 11
Anatomy of a write 1 2 esxi-01 esxi-02 esxi-03 RAID-1 FTT=1 vsphere 5 vsan 4 4 3 3 6 6 1. Guest OS issues write op to virtual disk 2. Owner clones write operation representing FTT policy 3. esxi-01, esxi-03 synchronously write to flash (log) 4. esxi-01, esxi-03 ACK prepare operation to owner 5. Owner receives ACK from both prepare operations and completes I/O 6. Batches of writes committed during destaging process
Anatomy of a Read (hybrid) 1 esxi-01 esxi-02 esxi-03 6 2 3 vsphere vsan 4 5 1. Guest OS issues a read on virtual disk 2. Owner chooses replica to read from Load balance across replicas Not necessarily local replica (if one) A block always reads from same replica 3. At chosen replica (esxi-03): read data from flash Read Cache, if present 4. Otherwise, read from HDD and place data in flash Read Cache Replace cold data 5. Return data to owner 6. Complete read and return data to VM
Anatomy of a Read (all-flash) esxi-01 esxi-02 esxi-03 1 6 2 3 vsphere vsan 4 5 1. Guest OS issues a read on virtual disk 2. Owner chooses replica to read from Load balance across replicas Not necessarily local replica (if one) A block always read from same replica 3. At chosen replica (esxi-03): read data from (write) Flash Cache, if present 4. Otherwise, read from capacity flash device 5. Return data to owner 6. Complete read and return data to VM
Orders per minute vsan Caching vmotion Orders per minute 5-minute moving average vmotion Consistent performance throughout Time (seconds) VMworld 2017 vsan caches based on frequency of data accessed, and spacial locality Smart data locality Improved flash utilization in cluster Avoid data migration with VM migration (e.g. DRS) Minor latency penalty Network latencies: 5 50 microseconds (10GbE) Flash latencies with real load: ~1 milliseconds Content: Not for publication vsan supports in-memory local cache. ( client cache ) Memory: very low latency Read caching using host RAM (0.4% of host RAM up to 1GB per host). Compliments CRBC
Checksum and disk scrubbing esxi-01 esxi-02 esxi-03 RAID-1 vsphere vsan Detects and resolves silent disk errors Checks data in flight, and at rest Upon checksum verification failure RAID-1: Fetches from other copy RAID-5/6: Rebuilds data Disk scrubbing will run in the background Dramatic performance improvements of checksum in 6.6 25%-73% more IOPS* 21%-44% reduction in latency*
Deduplication and Compression how it works esxi-01 esxi-02 esxi-03 RAID-1 vsphere vsan Deduplication Nearline Deduplication occurs on a per disk group level Deduplicated when destaging from cache tier to capacity tier Deduplication used 4KB fixed blocks for high dedup rates Compression Occurs after deduplication, prior to data being destaged If block is compressed <= 2KB Otherwise full 4KB block is stored All-Flash Only Beta
Deduplication and Compression Disk Group Stripes VMworld 2017 disk group Deduplication and Compression per disk group level Data stripes across the disk group Fault domain isolated to disk group Fault of device leads to rebuild of disk group Stripes reduce hotspots Endurance/Throughput Impact All-Flash Only Beta Content: Not for publication 18
All-Flash Only Deduplication and Compression (I/O Path) 1. VM issues write 2. VM issues write 3. Cold data to memory 4. Deduplication 5. Compression 6. Data written to capacity VMworld 2017 Avoids Inline or post process downsides Performed at disk group level 4KB fixed block LZ4 compression after deduplication Content: Not for publication 19
All-Flash Only Costs of Deduplication (nothing is free) 1. VM issues write 2. VM issues write 3. Cold data to memory 4. Deduplication 5. Compression 6. Data written to capacity CPU overhead Metadata and Memory overhead Overhead for Metadata? IO Overhead Metadata lookup Data movement from WB Fragmentation Endurance Overhead 20
All-Flash Only Costs of Compression (nothing is free) 1. VM issues write 2. VM issues write 3. Cold data to memory 4. Deduplication 5. Compression 6. Data written to capacity CPU overhead Capacity overhead Memory overhead IO overhead 21
All-Flash Only Erasure coding - RAID-5 and RAID-6 esxi-01 esxi-02 esxi-03 RAID-5 vsphere C1 C2 C3 vsan esxi-04 VMworld 2017 Content: Not for Object C4 Alternative to RAID-1 mirroring Guaranteed space efficiency feature available in all-flash configurations only Object comprised of components that are striped across devices publication Set per using SPBM policy RAID-5 implies a failures to tolerate (FTT) of 1 RAID-6 implies a failures to tolerate (FTT) of 2 22
All-Flash Only Data layout for RAID-5 parity data data data ESXi Host data parity data data ESXi Host RAID-5 data data parity data ESXi Host Object data data data parity ESXi Host Available in all-flash configurations only Example: FTT = 1 with FTM = RAID-5 3+1 (4 host minimum, 1 host can fail without data loss) 5 hosts would tolerate 1 host failure or maintenance mode state, and still maintain redundancy 1.33x instead of 2x overhead. 30% savings (20GB disk consumes 40GB with RAID-1, now consumes ~27GB with RAID-5)
Data-at-Rest Encryption Ingesting writes 5 esxi-01 esxi-02 esxi-03 RAID-1 1 3 2 6 7 8 4 9 vsphere vsan Encryption occurs in last step of I/O flow for highest level of protection and efficiency of dedup Incoming to buffer 1. Write I/O broken into 64K chunks 2. Checksum performed on 4K blocks 3. Encryption performed on 4K blocks 4. Lands in buffer Destaging 5. Decrypt performed on 4K blocks 6. Dedupe performed on 4K blocks 7. Compression performed on 4K blocks 8. Encryption performed on 2-4K blocks 9. Lands in persistent tier Data in flight is not encrypted
Swap Placement? VMworld 2017 Content: Not for Sparse Swap Reclaim Space used by memory swap Host advanced option enables setting How to set it? esxcfg-advcfg -g /VSAN/SwapThickProvisionDisabled publication https://github.com/jasemccarty/sparseswap 25
Snapshots for VSAN VMworld 2017 Content: Not for Not using VMFS Redo Logs Writes allocated into 4MB allocations Snapshot metadata cache (avoids read amplification) Performs Pre-Fetch of metadata cache publication Maximum: 31 26
vsan back end storage I/O explained 27
vsan storage traffic types vsphere vsan Datastore vsan VMworld 2017 Front end VM traffic Back end storage traffic Front end storage traffic Guest VM storage I/O traffic Back end storage traffic traffic Data resynchronizations Content: Not for publication Object policy changes Host or disk group evacuations Object or component rebalancing Object or component repairs 28
Repairs and Rebuilds 700GB RAID-1 FTT=1 W Occurs at the granular component level Reestablishes level of compliance for protection as defined in SPBM policy Repair process begins after 60 minutes from time reported as absent Works in non-stretched and stretched clusters
Repairs and Rebuilds 700GB RAID-1 FTT=1 C1 C2 C3 W Occurs at the granular component level Reestablishes level of compliance for protection as defined in SPBM policy Repair process begins after 60 minutes from time reported as absent Works in non-stretched and stretched clusters
Intelligent Rebuilds - Enhanced Rebalancing C1 RAID-0 C2 C3 70% 70% 85% 60% Disk capacity used on host BEFORE reactive rebalance. C1 RAID-0 VMworld 2017 Content: Not for C2 C3 C4 70% 70% 75% 70% Disk capacity used on host AFTER reactive rebalance with component split. Larger components can be split during redistribution Better balance Higher level of effective capacity publication Improved placement decisions reduces overhead. Faster time to completion Can be manually throttled for corner case scenarios Improved visibility in rebalancing status in Health & Performance Services
Intelligent Rebuilds - Smart, Efficient Repairs 700GB RAID-1 FTT=1 C1 C2 C3 W Two methods for repairs of offline components reappearing after 60 minutes. Calculates cost of methods of repair at time host comes back online Will choose most efficient method, and cancel other action Significant improvement in speed and efficiency of component repairs
Intelligent Rebuilds - Smart, Efficient Repairs 700GB RAID-1 FTT=1 W Will choose fastest option Two methods for repairs of offline components reappearing after 60 minutes. Calculates cost of methods of repair at time host comes back online Will choose most efficient method, and cancel other action Significant improvement in speed and efficiency of component repairs
Intelligent Rebuilds Using Partial Repairs Status: Two host failures Configured policy: FTT2 Effective compliance: FTT2 W W More resilient repair process Repairs as many components as possible even if not enough resources to ensure full compliance Remaining components will be repaired as soon as enough resources are available Works in non-stretched and stretched clusters
Intelligent Rebuilds Using Partial Repairs Status: Two host failures Configured policy: FTT2 Effective compliance: FTT0 W W More resilient repair process Repairs as many components as possible even if not enough resources to ensure full compliance Remaining components will be repaired as soon as enough resources are available Works in non-stretched and stretched clusters
Intelligent Rebuilds Using Partial Repairs Status: Partial repair completed Configured policy: FTT2 Effective compliance: FTT1 C1 W C2 C3 C1 C2 C3 More resilient repair process Repairs as many components as possible even if not enough resources to ensure full compliance Remaining components will be repaired as soon as enough resources are available Works in non-stretched and stretched clusters
Intelligent Rebuilds Using Partial Repairs Status: New host added. Full repair completed Configured policy: FTT2 Effective compliance: FTT2 W W More resilient repair process Repairs as many components as possible even if not enough resources to ensure full compliance Remaining components will be repaired as soon as enough resources are available Works in non-stretched and stretched clusters
Wrapping up 38
www.vspeakingpodcast.com @vpedroarrow @Lost_Signal