We look into some of the new features/additions/changes in the 2.6.38 kernel. In a nutshell: think performance enhancements, additional capability, and additional management options.
Who Doesn’t Like Performance?
Kernel development has lots of aspects – performance, stability, transparency, modularity, etc. Each of these aspects is addressed at one time or another while the kernel evolves. However, there are a group of us that are more performance oriented than others. Sometimes we are referred to as “performance junkies” or what I like to think of as “performance challenged”, but regardless of our label, we like to see more storage performance from Linux, particularly the kernel. The 2.6.38 kernel introduced some changes that helped performance making all of us performance challenged people very happy.
For some time there has been an effort to improve the scalability of the VFS. Remember that the VFS lies between system calls, where all of your I/O calls originates, and the file systems. With single servers having up to 48 cores (soon, 64 cores), and many of the cores possibly doing I/O, allowing the VFS to scale, particularly in terms of performance, is critical. So Nick Piggin dove in and started working on scalability patches to the VFS.
The VFS is not a place your want to dive into without considerable fortitude. There are many aspects you have to consider when writing VFS code. Nick’s patches were posted to the development mailing lists and some disagreement arose. Dave Chinner from Redhat, Al Viro, and others dove in to examine, critique, perhaps replace, or improve Nick’s patches. This work can be very complex and tedious so I would like to thank all of the developers who worked on the patches. It was a tough road, but having so many really good kernel developers work on the VFS will always result in a better VFS for everyone.
Some of the fruits of the first set of VFS scalability were reaped in 2.6.36. These laid the ground work for later patches, including some good ones in 2.6.38, the most recent kernel. In 2.6.38, the dcache (directory cache) and the lookup mechanisms have been redesigned and recoded to be more scalable. The details are very complicated but LWN has an article that explains it. Overall, the goal of the patches was to make various aspects of the VFS more scalable for multi-threaded workloads.
The impact of the patches is wonderful for multi-threaded codes, but curiously and fortuitously, it also impacts single threaded workloads. In some testing, a simple “find . -size” on a home directory was 35% faster with these patches (I believe this was for Linus’ home directory). Single threaded git diff on a cached kernel tree ran 20% faster. And even better, when 64 parallel git diffs were executed, the throughput performance increased by a factor of 26.
While the patches are still a bit controversial, and may require some further development, having them in the kernel will accelerate their review. Plus it gives us “performance challenged” people something to cheer about.
File System Improvements
In addition to the VFS patches, there were a number of file systems improvements in the 2.6.38 kernel.
There were a couple of good additions to btrfs. The first addition was to add LZO compression to btrfs. LZO compression is fairly fast and is built into the kernel so it’s a very logical place for file systems to go for compression capabilities with good performance and good compression.
The second feature added to btrfs is read-only snapshots. As any good storage administrator knows, having the ability to take snapshots but mark them read-only is the first step in a back-up process. You take the read-only snapshot so that it can’t be changed while you’re doing the backup. But, since it’s read-only, the amount of space it takes is fairly low and once the backup is done, it can be easily discarded. This is a very nice addition for btrfs and points to some more mainstream tools making their way into btrfs.
XFS has seen a great deal of additions in the 2.6.3x series. In 2.6.38 three major new additions were made:
- Manual SSD discard support was added via the FITRIM ioctl. This patch allows the FITRIM I/O control to be manually invoked forcing TRIM features in SSDs to be used. This isn’t intended to be run while the file system is performing I/O since this ioctl can cause performance degradation.
- Convert the inode cache lookups to use RCU locking. This change was made because there was a great deal of read vs. write contention when inode reclaim runs at the same time as lookups. This change greatly reduces this contention improving overall inode performance (basically metadata performance).
- This patch adds dynamic speculative EOF preallocation. This may sound complicated and the implementation is, but in a nutshell it reduces file system fragmentation when trying to speculate how much space to allocation (speculative allocation) during what is called delayed allocation (delayed allocation can help performance).
Lots of cool things happening in the XFS world.
Despite their age, both ext2 and ext3 are getting some additional capability because they are still widely used. In the 2.6.38 kernel, two new features were added. The first is really some optimizations on some functions that resulted in some speed improvements. The first patch improves throughput performance (about 14% on a Bonnie++ test) and the second patch improves metadata performance by about 6% on a Bonnie++ test.
The second patch added batched discard support (SSD Trim support) to ext2 and ext3.
While not a big change, in the 2.6.38 kernel, NILFS2 added fiemap ioctl. This ioctl (I/O Control) is used to get extent information for an inode. It is an efficient method for userspace to get file extent mappings.
One of my favorite file systems, SquashFS, added a new compression algorithm to its arsenal. Remember that SquashFS takes a subtree of a file system (or an entire file system) and creates a compressed image of it that can be mounted for read-only. This can save a great deal of space for data that is accessed read-only such as rarely touched data. In the 2.6.38 kernel, SquashFS added Xz compression. XZ uses the LZMA2 algorithm for lossless data compression. XZ usually achieves a higher compression ratio than bzip2 which, in the case of SquashFS, means that you can save more space.
There were a number of improvements to the general block devices in the 2.6.38 kernel. Some of them are particularly useful. For example, in the 2.6.37 kernel the ability to throttle the I/O was added. In the 2.6.38 kernel, this capability was enhanced by allowing the creation of hierarchical cgroups in the block cgroup controller. This means that you can create a hierarchical map that limits the I/O of certain “groups” and then you can further subdivide that I/O capability into sub-groups.
In the 2.6.38 kernel, we also saw a number of changes and additions to the Device Mapper (DM). Without going through all of them, here are the more major highlights.
- This patch improves the write throughput performance when writing to the origin with a snapshot on the same device. According to the patch, it looks like the performance was improved about 50% for the record sizes tested.
- This patch improves overall sequential write throughput performance about 20-25% for larger record sizes. The patch collects requests to the device mapper and sends them in a batch which allows the I/O queue to merge consecutive requests and send them to the device all at once, improving performance.
- If you are using the device mapper for encryption (dm-crypt), then this patch might be of interest to you. It allows dm-crypt to scale to multiple CPUs by changing the crypto workqueue to be on a “per-CPU” basis. This also improves performance since the workload is spread across multiple CPUs.
- The device mapper supports RAID-1 and in this patch support for discards was added. This means that trim is now supported in RAID-1 when using the device mapper (DM).
- There are really two device tools in the kernel – Device Mapper (MD), and the Multi-Device (MD). The kernel developers have been working to slowly bring these two capabilities closer together in the kernel with the possibility of merging them in future. In the 2.6.38 kernel, this patch is the skeleton for the DM target that will be the bridge from DM to MD. Initially this is for RAID-4, RAID-5, and RAID-6 (RAID456) but later RAID-1 as well. Basically this patch is a way to use device-mapper interfaces to the MD RAID456 drivers.
As you can see there is a great of work going on in the device mapper bringing more performance (thank you on behalf of us performance challenged people) and more capability.
The 2.6.38 kernel is a very good one for storage people. We’ve seeing performance improvements in a number of places including the VFS, the device mapper, ext2/3, and xfs. We’re seeing more support for SSD’s by the addition of discard (TRIM) capability in file systems (2.6.38 added this capability to ext2/3). Also, compression is becoming more important for file systems than can use it such as btrfs and SquashFS.
And finally, one of my pet peeves, is that we’re getting more capability to administer and control our storage systems. In the 2.6.38 kernel, the cgroup blkio capability was enhanced allowing hierarchical groups to be created giving us more control over assigning I/O capabilities to processes on the system.
The 2.6.38 was a very nice kernel for us storage oriented Linux people. Let’s keep an eye on 2.6.39 and 2.6.40 for more development because I think we’ll see some more changes in the VFS.