Two kernel releases have gone by since we last checked in with the check-ins. While the storage related changes are seemingly minimal, it's always good to review what changed; you might be surprised.
Recently, I have spent some time reviewing Linux kernels from a storage perspective; starting with 2.6.30 (actually going back into the 2.6.2x series) and ending with the 2.6.34 kernel. That article was published on June 8 of this year – just a short time ago. In the meantime, the pace of kernel development has continued and as of the writing of this article the merge window for the 2.6.37 kernel has just closed signaling the beginning of the great bug hunt before the release of the 2.6.37 kernel. So now seems like a good time to go over the 2.6.35 and 2.6.36 kernels to review what changes have happened that affect the Linux storage world.
One of the best sources of information about Linux kernel versions is kernelnewbies.org”. It has a very nice writeup of the higher level kernel changes but the writeup can be a little terse. However, they can be very useful as a starting point. In particular, let’s start with the 2.6.35 highlights as a starting point for a discussion about the storage changes in the 2.6.35 kernel.
One of the first items listed in the kernel newbies 2.6.35 review is about changes to one of the favorite Linux file systems, btrfs. In 2.6.35, btrfs added support for direct I/O. Direct I/O is a method that allows the data to go from the application buffers directly to the storage devices (hence the term “direct”). This bypasses all of the caching and other IO techniques in the VFS (e.g. read-ahead, write coalescence). For some workloads Direct I/O can result in improved performance and reduced CPU usage on the host. A classical example of a workload that can benefit from Direct I/O is a database workload.
The second updated file system that was highlighted is xfs. XFS has been around a fairly long time and was originally ported to Linux by SGI (the “old” SGI) a while ago. For many years xfs was the file system of choice for high performance but it is also known to have less than stellar metadata performance. However, thanks to Redhat and other developers, xfs development has been revived and there are a number of new features/capabilities in today’s xfs (i.e. “it’s not your father’s xfs”).
In the 2.6.35 kernel, a new journaling technique was added to xfs that is somewhat based on ext3 and reiserfs. This journal approach collects asynchronous transactions in memory. This reduces the I/O bandwidth on the log by several orders of magnitude, allowing much better metadata performance for heavy metadata workloads. In addition, the I/O bandwidth used for the log decreases by orders of magnitude. This patch didn’t change the log format on the disk but only changed the in-memory potion of the logging. However, this journal mode, called “delayed logging”, is still experimental, but you can test it using a mount option: “-o delaylog”. If you are using xfs and have heavy metadata workloads, you might want to consider trying this logging option in a test environment.
The kernel newbies review listed the btrfs and xfs changes as fairly major, but there were some other changes to other file systems that are fairly important as well. To make things easier, I’ve create a list of the file system and the significant changes.
- Squashfs: xattr (extended file attributes) support was added. This allows squashfs to retain xattr values associated with data and it also allows squashfs to be better combined (“unioned”) with other file systems so that a single data tree can seemingly be both read and write.
- Ext2: The remaining pieces of the BKL (Big Kernel Lock) were removed from ext2 in 2.6.35. The BKL was originally used to get SMP (Symmetric Multi-Processing) capability into the kernel fairly quickly but it also reduced the scalability of the kernel as well as imposing some performance penalties. These changes mean that the last remnants of the BKL in ext2 have been removed from the Linux kernel. While you may scoff at people still focusing on ext2, it is still widely used and, believe it or not, it has good performance.
- Ext4: The big change with ext4 in 2.6.35 was the addition of a check for a good block group before loading “buddy pages”. This change speeds up allocations particularly for the case where partitions are relatively full.
- NILFS: This log-structured file system added a subtle change in 2.6.35. It changed the default of “errors” mount option to “remount-ro” mode. Previously when nilfs encountered an error it would allow operations to continue, possible further causing errors. With this change, the default behavior in nilfs is to mount the file system as read-only (“ro”) so that further damage is minimized.
- OCFS2: In 2.6.35 there were several changes to OCFS2 (one of the under appreciated file systems in Linux). The first change was to implement allocation reservations, which can greatly reduce fragmentation because allocations for a specific file can be done much more sequentially (this has implications for performance as well). Several smaller changes were made as well and they are listed below (from the kernel newbies list)
- The “punch-hole” code was further optimized to speed up some rare operations.
- A dis-contiguous block group was created to improve some types of allocations. It marks an incompatible bit which makes it a forward-compatible change.
- The no file interruption (nointr) mount option was made default.
There were also some additions to the block I/O layer for the kernel (blkio) that impact storage. These changes can be important because they affect how I/O operations are done within the kernel including monitoring of I/O operations. The first set of changes are the addition of some statistics that can help with monitoring. These statistics are:
The first four statistics are accumulated per operation type (I/O operation) helping us to distinguish between read and write operations and sync and async I/O operations. The fifth statistic, io_merged, is mostly used for debugging. The last two statistics are also primarily used for debugging as well. They keep statistics of the queue depth of the cgroup (a part of the I/O controller). The authors, Jens Axboe and Divyesh Shah, state that they can be used for debugging user I/O problems.
A few more patches were added to gather the per-cgroup statistics which can be used for debugging problems as well. One of the more important patches was to add a “reset_stats” interface so that all of these statistics can be reset (great for debugging).
Overall, these kernel changes may not result in obvious changes to Linux storage, but actually they are fairly important. We’re seeing the continued development of btrfs, albeit somewhat slower than we like. This could possibly point to the stabilization of btrfs but people are still asking for the additional features. Similarly we’re seeing changes to ext4 that capture more “corner cases” which can affect performance.
Finally, one change that is important whether you might not think so, is the addition of the statistics to the I/O controller. These statistics can greatly help developers when debugging problems or when users encounter problems. Scheduling I/O operations can be very complex and greatly depend upon the details of the workload so it’s almost impossible a priori to know how workloads will interact with the I/O scheduler. Being able to understand what the I/O scheduler is actually doing is important in finding and debugging problems or for improving performance (remember that bad performance is considered by many to be a “bug”).