A whitepaper from Sybase, an SAP Company. SQL Anywhere I/O Requirements for Windows and Linux

A whitepaper from Sybase, an SAP Company. SQL Anywhere I/O Requirements for Windows and Linux 28 March 2011

2 Contents Contents 1 Introduction 4 2 The Need for Reliable Storage 4 2.1 Recovery in SQL Anywhere..................... 4 2.2 Durability of Transactions...................... 5 3 The Storage Stack 6 3.1 High Level Overview......................... 6 3.2 Disk Drives.............................. 6 3.2.1 Windows........................... 7 3.2.2 Linux............................. 7 3.3 Storage Controllers.......................... 8 3.3.1 Windows........................... 8 3.3.2 Linux............................. 8 3.4 Storage Drivers............................ 9 3.4.1 Windows........................... 9 3.4.2 Linux............................. 10 3.5 Filesystems.............................. 11 3.5.1 Windows........................... 12 3.5.2 Linux............................. 12 4 System Configuration 13 4.1 Windows............................... 13 4.1.1 Disabling the write cache.................. 13 4.2 Linux................................. 14 4.2.1 Determining the device name................ 14 4.2.2 Disabling the write cache.................. 15 5 Conclusions 15 5.1 Windows............................... 16 5.2 Linux................................. 16

1. Introduction 3 1 Introduction Database servers need to be able to guarantee that data gets to stable storage to ensure that committed transactions persist and to properly implement recovery in case of power loss. Operating systems and storage devices cache and reorder write operations in order to improve performance. Running a database server on an improperly configured system can lead to data loss and file corruption. This document aims to provide the background necessary to understand the durable storage requirements and I/O semantics of SQL Anywhere. 2 The Need for Reliable Storage 2.1 Recovery in SQL Anywhere After experiencing an abnormal shutdown due to a database server crash, operating system crash, or loss of power, SQL Anywhere must perform recovery before a database file can be used. The recovery strategy used consists of two phases. In the first phase, pages within the database file are reverted to the state they were in at the time of the last checkpoint. In the second phase, any transactions committed since the last checkpoint are replayed according to the operations recorded in the transaction log. Successful recovery depends on the database server s actions during normal operation. Under normal operation, special action is taken the first time a page in the database file is modified since the last checkpoint. Before the database server can allow any modifications to occur it must save a copy of the page. This unmodified copy is called a pre-image and is stored within the database file itself. The collection of all pre-images since the last checkpoint forms the checkpoint log. Once a checkpoint completes, the checkpoint log is discarded and this process is repeated. Should the database server terminate abnormally, the checkpoint log provides the mechanism needed to perform the first phase of recovery. SQL Anywhere cannot allow a modified page to be written to disk prior to its pre-image or recovery would be impossible. To enforce this requirement, the database server issues an operating system provided flush operation after writing the pre-image, and it does not write the newly modified page until the flush operation completes. In order to ensure the robustness of SQL Anywhere s recovery scheme, the flush operation provided by the operating system must guarantee that any write operations issued prior to the flush are on stable storage when the flush operation completes. To illustrate the point, suppose that a row was deleted from a table and that the table page was updated and made it out to disk before its pre-image and, at that instant, an operating system crash occurred. During recovery, all of the other pre-images would be applied, reverting most of the database but not

4 2. The Need for Reliable Storage the page that was just updated. There is a requirement during recovery that any operation logged in the transaction log will succeed when replayed since for that statement to have been recorded it must have succeeded initially. In this case, applying the transaction log would result in a delete statement attempting to delete the row and index entries. While the index entries should be present (since their pages were rolled back) the row on the table page is not present and therefore the delete fails. This inconsistent nature of the data causes recovery failure. Not only is it important that the database writes are flushed to disk, but it is also required that at certain times the file metadata is flushed to disk. This file metadata (or filesystem information about the file) is also physically stored on disk. The file metadata includes information about the file such as where the file is stored on disk and the size of the file. SQL Anywhere issues a flush request at a point in time when it is required that the metadata be stored on stable media. Here s an example of why it is important that the file metadata is also written to stable media. Suppose that the database file is full (no free pages) and a new page needs to be allocated. SQL Anywhere needs to grow the file to store this data. Assume that the file is grown by some number of pages and that some page in the database now refers to data on this new page. Even though the write of the new page makes it to stable media, there is still a potential problem if there is a power loss to the drive. Such a power loss could result in a shorter database file upon recovery. If the filesystem information containing the updated file size was not committed to disk, then the file will revert to the length that was last stored. In this case there is data referring to pages beyond the end of the file. SQL Anywhere avoids this situation because prior to a write that involves something referencing these new pages, it makes a flush request. SQL Anywhere waits for the flush request to finish before performing any writes that could leave the database in an inconsistent state in the event of a power loss. 2.2 Durability of Transactions SQL Anywhere, as an ACID 1 compliant database, requires that transactions are durable once they are successfully committed. This means that even in the event of power loss, the effects of a transaction will be persistent once the database is brought back up. For this reason, every time a commit is issued, SQL Anywhere requires that the transaction is physically written to the transaction log on disk. By replaying the operations in the transaction log that occurred after the most recent checkpoint, SQL Anywhere can implement the second phase 1 ACID stands for atomicity, consistency, isolation, durability.

3. The Storage Stack 5 of recovery and safely restore the database to the same state it was in before power loss. When a transaction commits, SQL Anywhere uses a combination of flush operations and an I/O mechanism known as writethrough (discussed in the next section) to ensure that the operations of the entire transaction have been recorded to the transaction log on stable storage. If the operating system cannot guarantee the reliablility of these operations, SQL Anywhere cannot guarantee the durability of transactions. 3 The Storage Stack 3.1 High Level Overview An operating system s storage stack is made up of a number of layers. A misconfiguration in any one of these layers can cause the entire stack to become unreliable. A simplified view of a storage stack from top to bottom is given below: Filesystem Storage Driver Storage Controller Disk Drive SQL Anywhere databases, transaction logs, and dbspaces are simply regular files on a filesystem. The filesystem is provided by the operating system and is responsible for turning operations on files and directories into I/O requests that can be issued to the storage driver. The storage driver then forwards its requests directly to the storage controller (hardware) which finally passes the requests down to the disk drive. The following sections examine how each of these layers affects the reliability of the storage stack, starting from the bottom and moving up. 3.2 Disk Drives Disk drives are non-volatile storage meaning that once information is written to the disk it will persist when the power is removed. Disk drives have traditionally been implemented with spinning media, but Solid State Drives (SSDs) are becoming increasingly common. Regardless of the technology in use, modern disks almost always employ a fast (but volatile) DRAM buffer as a write cache to improve performance. For rotational disks, the write cache allows the drive to delay and reorder I/Os to mitigate the natural delays introduced by waiting for the platter to spin to the needed location. For SSDs, a write cache is employed to decrease the performance impact of rewrites caused by remapping

6 3. The Storage Stack OS I/Os into the large, block-sized writes used internally by the SSD. In either case, the drive will report that an I/O has completed as soon as it has been successfully stored in the write cache. If power is lost while an I/O is sitting in the write cache but before it is written to the physical media, the I/O is lost. Since the disk reports that the I/O has completed once it is stored in the volatile write cache, the higher levels in the storage stack need a way to guarantee that a given piece of data really is on the non-volatile medium. Both the SCSI and ATA command sets provide commands for explicitly flushing the disk caches. The SCSI and ATA command sets also provide an I/O mechanism known as Force Unit Access (FUA). The FUA bit is set on a per I/O basis. A write that has the FUA bit set tells the disk that this I/O must not be considered complete until it has reached stable media. An I/O configured to use FUA is sometimes referred to as a writethrough since the write is going through the cache, directly to the physical medium. For performance reasons, SQL Anywhere does not issue a flush to the transaction log after every commit. It instead relies on the presence of FUA to guarantee that committed transactions are present on disk. If the underlying operating system does not provide a reliable FUA operation and write caching is in use, SQL Anywhere does not guarantee transaction durability. 3.2.1 Windows On Windows, a disk flush can be requested using the FlushFileBuffers() call. SQL Anywhere uses this call in critical places to ask the OS to guarantee data reliability. Unfortunately, bugs in certain I/O drivers mean that flush commands are not always passed to the disk as a result of this call. SQL Anywhere also requests that I/Os use FUA by opening database and transaction log files by passing the FILE FLAG WRITE THROUGH flag to CreateFile(). Unfortunately, the handling of FUA is not consistent across all disk manufactures and versions of Windows. It has been observed that some ATA based implementations discard the FUA bit entirely, compromising the reliability of SQL Anywhere. Versions of Microsoft Windows prior to Windows 2000 Service Pack 3 (SP3) have been known to inconsistently propagate the FUA bit to the disk. Windows may require setting a registry key to cause the FUA bit to be propagated. SCSI implementations honor the FUA bit. 3.2.2 Linux On Linux, disk flushes are requested via the fsync call. Unfortunately, misconfiguration and implementation quirks at the higher levels of the I/O stack in Linux mean that a call to fsync does not result in a flush command being sent

3.3. Storage Controllers 7 to the disk. In most cases, it is necessary to disable the on-disk write cache altogether to prevent file corruption from occurring. Linux does not expose a method for user-land processes to request FUA support for I/Os. As a result, disk write caches must be disabled on Linux to guarantee transaction durability. 3.3 Storage Controllers Storage controllers are hardware which map commands from the operating system into actual operations on one or more disks. Commodity PCs typically have a simple storage controller integrated into the motherboard of the computer which allows communication with a handful of individual disks. Server-class machines may have dedicated storage controllers that implement more advanced features such as RAID and virtual drive configuration. Some hardware-based RAID controllers may even contain a battery backed write-cache which allows the controller to guarantee the durability of data in the write cache. If you have such a controller, you are safe to enable its cache so long as the write caches of the individual disks themselves are disabled. Some commodity PC manufactures are now providing firmware-based RAID controllers which are misleadingly marketed as hardware RAID controllers. These controllers, referred to as FakeRAID or HostRAID in fact perform only the most basic of operations and rely on the operating system to provide a driver to actually implement the RAID operations. The implications of FakeRAID are platform-dependant as follows. 3.3.1 Windows Windows provides a software RAID built into the OS. The choice of storage controller drivers may be an issue though as described in the storage driver section for Windows. Again, it may be required to turn off the disk caching feature for reliable behavior. 3.3.2 Linux On Linux, the software RAID is handled via the device mapper level (described in section 3.4.2). This layer is known to strip out requests to flush the disk cache in many situations. Devices used for such a configuration under Linux typically have names starting with the md prefix. If you have device names starting with md, you should either reconfigure your system to not use RAID or ensure that write caches are disabled on the underlying disks.

8 3. The Storage Stack 3.4 Storage Drivers Immediately above the storage controller in the I/O stack lies the storage driver. Storage drivers are specific to the storage controller in use and are responsible for translating between the generic block I/O operations used by the kernel and the native command set (SCSI or ATA) of the controller. 3.4.1 Windows As discussed previously, on Windows, SQL Anywhere opens the database file using the flag FILE FLAG WRITE THROUGH. Again this means that the database server should not be notified of a completed write until the write is actually stored on the physical media. As previously mentioned, the handling of the FILE FLAG WRITE THROUGH is inconsistent across different hardware and versions of Windows. The fact that disks continue to buffer writes in spite of this flag setting means that writes can potentially reach the disk out of order. Not only are there points in time when writes need to take place in order, but there are also points in time when the file metadata needs to be on the stable media. SQL Anywhere calls FlushFileBuffers() to force any outstanding writes and the metadata to disk. FlushFileBuffers causes a Synchronize Cache (SCSI) or a Flush Cache (IDE/ATAPI) command to be sent to the storage device. This is important because it means that any write that is going to occur afterwards will be guaranteed that all the writes prior to the flush request are already on stable media. It also means that the metadata stored on the stable media will properly reflect that state of the database files at that point in time. The reason that this discussion is in the Storage Drivers section and not the filesystem section is that it is known that some storage drivers ignore the Flush Cache request. Perhaps there is an assumption that these storage devices are being used with a battery backup, but this is certainly not always the case. A consequence of the Flush Cache command is that the entire disk cache must be flushed to stable media. This is potentially costly and can affect performance. These drive/driver combinations therefore provide increased performance at the cost of recoverability. Running SQL Anywhere with a storage driver that fails to issue any Flush Cache requests is risky. The ACID requirements cannot be met under such circumstances and there is potential for the corruption of database files. It is extremely important to know that your system is performing flushes of the disk cache when requested. Contact your hardware and software vendors to make sure that your system is compliant with these requirements.

3.4. Storage Drivers 9 3.4.2 Linux Complicating matters slightly is the fact that Linux contains two sets of ATA drivers: the old /stable set is mostly used for old parallel IDE devices, while the new set referred to as libata is mainly used for SATA devices, but most of the old drivers have been rewritten to the new model. The new model maps SCSI commands into ATA commands to simplify the higher levels of the storage stack and make all block devices look like SCSI devices. If you know that your disk is a SATA disk, bu that the device name is /dev/sda (instead of /dev/hda), you are most likely using the new libata based drivers. 3.4.2.1 Device Mapper The device mapper layer is used to emulate features available in high-end hardware storage controllers in software. As previously mentioned, the device mapper layer is used to support software RAID and the Logical Volume Manager (LVM). Software RAID allows the Linux kernel to present multiple physical disks as a single disk using the algorithms defined for the standard RAID levels without requiring any special hardware. LVM allows the creation of virtual volumes (for example, disks) and partitions. LVM has some very desirable features including the ability to add new physical disks to a logical volume (increasing its size) and the ability to resize partitions without destroying data. The convenience of administrating a system with an LVM has made its use a default choice during installation on a number of popular Linux distributions. Despite the convenience it provides, the device mapper layer hampers storage reliability. Until very recently, it stripped out any requests from the higher levels in the storage stack to flush disk caches. This has now been addressed for a limited set of configurations involving only single-disks in Linux kernels 2.6.29 and higher. Given the traditional uncertainty of the behavior of flush requests through the device mapper layer, it is recommended that LVM and software RAID be avoided on systems running SQL Anywhere. 3.4.2.2 The Block I/O Layer The Block I/O Layer consists of 5 main components. 1. A queue of outstanding I/O requests. 2. A simple device independent interface to block-based I/O. 3. A set of I/O schedulers that merge and reorder requests in the queue to maintain user-configurable system responsiveness requirements. 4. A cache of recently used filesystem blocks.

10 3. The Storage Stack 5. A plugable backend used by storage drivers to map generic block I/O requests to real I/O operations. The block I/O layer provides a simple interface to read and write blocks from disk. Interestingly, it doesn t directly expose a method to flush the write cache of a disk drive. Instead, it exposes the more generic operation of an I/O barrier. When an I/O barrier is inserted into the queue of pending I/Os, it guarantees that any operations after the barrier won t complete until all operations before the barrier have completed. Notice that a flush can be easily simulated by enqueuing a barrier operation followed by a write and waiting for the write to complete. Since the write can t complete until all operations before the barrier have completed, when the write is complete SQL Anywhere can guarantee that all data in any caches has been permanently flushed to permanent storage. All current storage drivers implement barrier operations by issuing commands to flush the disk write cache. Linux provides a number of different I/O schedulers that can be selected by the user at runtime by modifying files in the /proc filesystem. Each scheduler implements a different policy. The deadline and anticipatory schedulers aim for maximum I/O throughput while the CFQ scheduler aims to fairly distribute I/O bandwidth between different user applications on the system. Finally the no-op scheduler passes requests as they come directly to the storage drivers. The selection of an I/O scheduler does not affect the reliability of the storage stack and you are encouraged to experiment with the various I/O schedulers and choose the one which provides the best performance for your particular workload. To improve overall system performance, all I/Os passing through the block I/O layer are cached. This cache, commonly called the Linux page cache, dynamically grows to consume any unused memory and automatically shrinks when memory is required by applications. Linux periodically flushes the page cache with a background task known as pdflush. When an application issues an fsync, this cache is also explicitly flushed. Flushing this cache does not involve barrier operations and can always be done reliably. Since SQL Anywhere already maintains its own cache of database pages, it bypasses the use of the cache by default by using direct I/O. This reduces competition for memory between SQL Anywhere and the Linux kernel and generally improves performance. If desired, SQL Anywhere can be configured to not use direct I/O by specifying the -u option when starting the database server. 3.5 Filesystems SQL Anywhere uses regular files for its database. As a result, all operations to and from the database ultimately pass through a filesystem driver. Filesystems manage two types of data file data and meta data. File data is the actual data written by the application; in SQL Anywhere s case this is the actual data

3.5. Filesystems 11 contained in the database file. Meta data is the data required by the filesystem to manage files. Meta data includes things like the creation time, file size, permissions, file name and the set of disk blocks allocated to the file. 3.5.1 Windows Windows operating systems and SQL Anywhere both support the FAT and NTFS filesystems. NTFS is a journaling filesystem which uses a mechanism similar to a transaction log in database systems for managing modifications to the disk. NTFS, like database systems, relies on the underlying hardware to properly flush data to the media. The filesystem itself can essentially revert to an older state in the event of a power outage as a result of unwanted caching. A more serious consequence is that the filesystem itself can actually become inconsistent and corrupted. 3.5.2 Linux Linux supports a wide variety of filesystems. This section focuses on the most popular choices of Ext2, Ext3, Ext4 and XFS. Of the four, Ext3, Ext4 and XFS are journalled filesystems while Ext2 is non-journalled. Journalled filesystems internally implement a logging facility similar to the one used in database systems to manage modifications to the on-disk representations of the data structures used. In the event the filesystem is unmounted uncleanly, it is able to bring itself back to a consistent state on the next mount. Ext2, which lacks a journal, is likely to become corrupt after an abnormal unmount and will require repair using the fsck utility. Ext3 and Ext4 provide three different journalling modes: data=journaled In this mode all data and metadata changes are written to the journal. This option performs poorly as it requires that all data be written twice. This mode also precludes the use of direct I/O, further degrading performance for SQL Anywhere. data=ordered In this mode, only meta data changes are journaled, but any pending data changes are flushed to disk before a journal commit occurs. This mode provides the same consistency guarantees as data=journaled, but it performs much better. This is the default journalling mode for Ext3 and Ext4. data=writeback In this mode, only meta data changes are journaled, but pending data changes are not forced to disk before a journal commit occurs. In the event of an operating system crash or loss of power, operating in data=writeback mode means that new portions of files may contain stale blocks after recovery. This may pose a security risk, but does

12 4. System Configuration not pose a risk to reliability so long as the application properly flushes its data via fsync. Correct operation of the data=ordered and data=writeback modes depends on the filesystem issuing barrier operations to the Block I/O Layer. Surprisingly, Ext3 and Ext4 are not configured to use barriers by default and the option must be explicitly enabled. To enable barrier support for Ext3 and Ext4, you should add the barrier=1 option to the mount options for your filesystem given in /etc/fstab and then reboot. XFS supports only a single journalling mode that is roughly equivalent to data=writeback. The use of barriers by XFS is enabled by default, but may be disabled by giving the /nobarrier option to the the filesystem. If you are using XFS, you should verify that the /nobarrier option is not being used by examining the mount options for your filesystem in /etc/fstab. SQL Anywhere s reliability requirement is that a call to fsync guarantees that any writes issued before the fsync are on stable storage when the call to fsync returns. If the disk write cache is not in use, Ext2, Ext3, Ext4 and XFS all meet this requirement. If the write cache is in use, the filesystem must issue a barrier operation on an fsync in order to cause the cache to be flushed. Ext2 does not issue barrier operations of any sort, and hence, is not safe. XFS and Ext3 and Ext4 operating in data=ordered or data=writeback mode have traditionally issued barrier operations on an fsync only if there had also been a change to the metadata for the file. An in-place update of a file causing no metadata changes would not be sufficient to cause these filesystems to issue a barrier on a call to fsync. Starting with kernel 2.6.32 for Ext3 and Ext4, and kernel 2.6.33 for XFS, these filesystems now always issue barrier operations on an fsync provided the filesystems are mounted with the options required to support barriers. If you are using an older version of the kernel or if you do not have barrier support enabled, you will not be able to enable the write cache of the drive and maintain SQL Anywhere s requirements for reliable storage. 4 System Configuration Unwanted caching by the disk is an important factor leading to corrupt or inconsistent data when power loss occurs. Disabling the write cache of a disk can alleviate many problems. 4.1 Windows 4.1.1 Disabling the write cache Go to device manager. (On Windows 7 and Windows Vista, choose Control Panel Device Manager. On Windows XP, choose Control Panel

4.2. Linux 13 Administrative Tools Computer Management Device Manager.) Right-click Disk Drives and choose Properties. Click the Policies tab. Uncheck Enable Write Caching. 4.2 Linux 4.2.1 Determining the device name The df command can be used to determine the device on which a SQL Anywhere database lives. The SQL Anywhere database is often comprised of multiple dbspace files and a transaction log file (and perhaps a mirror log file). The following example just uses the main database file, or system dbspace. To determine the device, pass the SQL Anywhere database file as the sole argument to df. The device file representing the partition on which the file lives will be given in the first column of the output: $ df /opt/sqlanywhere11/demo.db Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 921150000 394688772 526461228 43% / In this case the device name reported is /dev/sda1. To determine the name of the real block device, remove any numbers from the suffix of the device name reported. In this example, the device name of the underlying block device is /dev/sda. The device name gives you some insight into the type of device being used. If the device name starts with /dev/sd, you either have a SCSI disk, or you have a SATA disk and are using the new libata based driver. If the device name starts with /dev/hd, you have an ATA based device and are using the old parallel-ide -based drivers. If the device name starts with /dev/md, your system is using the device mapper layer to implement a software RAID. If your device name starts with /dev/mapper/, your system is using the device mapper layer to implement LVM. If you are using a software RAID or LVM, you must determine the real underlying device in order to disable the write cache of the drive. In the case of a software RAID, the underlying device can be determined by examining the /sys filesystem. Suppose that you used the df command to learn that your database file resided on the device md0. Then /sys/block/md0/md/ would contain symlinks with the prefix dev- to the real underlying devices. $ ls -d /sys/block/md127/md/dev-sd* /sys/block/md127/md/dev-sda /sys/block/md127/md/dev-sdb

14 5. Conclusions When LVM is being used, the pvs tool can be used to enumerate all the physical volumes on the machine and the volume group that they are allocated to. $ df /opt/sqlanywhere11/demo.db Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/volgroup00-logvol00 28376956 14314528 12597700 54% / $ pvs PV VG Fmt Attr PSize PFree /dev/sda2 VolGroup00 lvm2 a- 29.88G 0 /dev/sdb1 HomeGroup lvm2 a- 10.00G 0 /dev/sdb2 HomeGroup lvm2 a- 989.99G 0 In this case, the database file lives on the device /dev/volgroup00-logvol00 and by following the output of pvs you can determine that it lives on the physical device /dev/sda. 4.2.2 Disabling the write cache To disable the write cache, first determine the underlying physical device on which the SQL Anywhere database lives using the techniques given in the previous section. The tool used to disable the write cache depends on the type of device being used. If the device name starts with /dev/sd, the sdparm tool should be used: $ sdparm --clear=wce /dev/sda To make the change persistent across reboots, use the command: $ sdparm --clear=wce --save /dev/sda If the device name starts with /dev/hd, the hdparm tool should be used: $ hdparm -W 0 /dev/hda Unlike sdparm, the hdparm command does not provide a mechanism for persisting the change across reboots. To make this change persistent, you should create an init script according to the documentation of your distribution and include the hdparm command given above. 5 Conclusions SQL Anywhere has a single, modest requirement of the storage stack provided by the operating system. The flush call must guarantee that all writes issued before the call are on stable storage once the call completes. When you use a disk with the write cache enabled, a number of conditions can make the storage stack unstable.

5.1. Windows 15 5.1 Windows If any of the conditions below exist, disabling the write cache as explained earlier will make the system more recoverable. Failure to respect the FUA bit Failure to respect a Flush Cache request. 5.2 Linux If your system is configured such that any of the conditions given below are true, you must disable the write cache on your disk as described earlier in this whitepaper. Only if all of these conditions are false may you safely enable the write cache on your disk. Using the Ext2 filesystem Using the Ext3 or Ext4 filesystem without the barrier=1 mount option Using the Ext3 or Ext4 filesystem with a Linux kernel older than 2.6.32 Using the XFS filesystem with the /nobarrier mount option Using the XFS filesystem with a Linux kernel older than 2.6.33 Using LVM, Software RAID, FakeRAID or HostRAID with a kernel older than 2.6.29