Storage Highlights in 2.6.38

We look into some of the new features/additions/changes in the 2.6.38 kernel. In a nutshell: think performance enhancements, additional capability, and additional management options.

Who Doesn’t Like Performance?

Kernel development has lots of aspects – performance, stability, transparency, modularity, etc. Each of these aspects is addressed at one time or another while the kernel evolves. However, there are a group of us that are more performance oriented than others. Sometimes we are referred to as “performance junkies” or what I like to think of as “performance challenged”, but regardless of our label, we like to see more storage performance from Linux, particularly the kernel. The 2.6.38 kernel introduced some changes that helped performance making all of us performance challenged people very happy.

For some time there has been an effort to improve the scalability of the VFS. Remember that the VFS lies between system calls, where all of your I/O calls originates, and the file systems. With single servers having up to 48 cores (soon, 64 cores), and many of the cores possibly doing I/O, allowing the VFS to scale, particularly in terms of performance, is critical. So Nick Piggin dove in and started working on scalability patches to the VFS.

The VFS is not a place your want to dive into without considerable fortitude. There are many aspects you have to consider when writing VFS code. Nick’s patches were posted to the development mailing lists and some disagreement arose. Dave Chinner from Redhat, Al Viro, and others dove in to examine, critique, perhaps replace, or improve Nick’s patches. This work can be very complex and tedious so I would like to thank all of the developers who worked on the patches. It was a tough road, but having so many really good kernel developers work on the VFS will always result in a better VFS for everyone.

Some of the fruits of the first set of VFS scalability were reaped in 2.6.36. These laid the ground work for later patches, including some good ones in 2.6.38, the most recent kernel. In 2.6.38, the dcache (directory cache) and the lookup mechanisms have been redesigned and recoded to be more scalable. The details are very complicated but LWN has an article that explains it. Overall, the goal of the patches was to make various aspects of the VFS more scalable for multi-threaded workloads.

The impact of the patches is wonderful for multi-threaded codes, but curiously and fortuitously, it also impacts single threaded workloads. In some testing, a simple “find . -size” on a home directory was 35% faster with these patches (I believe this was for Linus’ home directory). Single threaded git diff on a cached kernel tree ran 20% faster. And even better, when 64 parallel git diffs were executed, the throughput performance increased by a factor of 26.

While the patches are still a bit controversial, and may require some further development, having them in the kernel will accelerate their review. Plus it gives us “performance challenged” people something to cheer about.

File System Improvements

In addition to the VFS patches, there were a number of file systems improvements in the 2.6.38 kernel.

There were a couple of good additions to btrfs. The first addition was to add LZO compression to btrfs. LZO compression is fairly fast and is built into the kernel so it’s a very logical place for file systems to go for compression capabilities with good performance and good compression.

The second feature added to btrfs is read-only snapshots. As any good storage administrator knows, having the ability to take snapshots but mark them read-only is the first step in a back-up process. You take the read-only snapshot so that it can’t be changed while you’re doing the backup. But, since it’s read-only, the amount of space it takes is fairly low and once the backup is done, it can be easily discarded. This is a very nice addition for btrfs and points to some more mainstream tools making their way into btrfs.

XFS has seen a great deal of additions in the 2.6.3x series. In 2.6.38 three major new additions were made:

  1. Manual SSD discard support was added via the FITRIM ioctl. This patch allows the FITRIM I/O control to be manually invoked forcing TRIM features in SSDs to be used. This isn’t intended to be run while the file system is performing I/O since this ioctl can cause performance degradation.
  2. Convert the inode cache lookups to use RCU locking. This change was made because there was a great deal of read vs. write contention when inode reclaim runs at the same time as lookups. This change greatly reduces this contention improving overall inode performance (basically metadata performance).
  3. This patch adds dynamic speculative EOF preallocation. This may sound complicated and the implementation is, but in a nutshell it reduces file system fragmentation when trying to speculate how much space to allocation (speculative allocation) during what is called delayed allocation (delayed allocation can help performance).

Lots of cool things happening in the XFS world.

Despite their age, both ext2 and ext3 are getting some additional capability because they are still widely used. In the 2.6.38 kernel, two new features were added. The first is really some optimizations on some functions that resulted in some speed improvements. The first patch improves throughput performance (about 14% on a Bonnie++ test) and the second patch improves metadata performance by about 6% on a Bonnie++ test.

The second patch added batched discard support (SSD Trim support) to ext2 and ext3.

While not a big change, in the 2.6.38 kernel, NILFS2 added fiemap ioctl. This ioctl (I/O Control) is used to get extent information for an inode. It is an efficient method for userspace to get file extent mappings.

One of my favorite file systems, SquashFS, added a new compression algorithm to its arsenal. Remember that SquashFS takes a subtree of a file system (or an entire file system) and creates a compressed image of it that can be mounted for read-only. This can save a great deal of space for data that is accessed read-only such as rarely touched data. In the 2.6.38 kernel, SquashFS added Xz compression. XZ uses the LZMA2 algorithm for lossless data compression. XZ usually achieves a higher compression ratio than bzip2 which, in the case of SquashFS, means that you can save more space.

Block Improvements

There were a number of improvements to the general block devices in the 2.6.38 kernel. Some of them are particularly useful. For example, in the 2.6.37 kernel the ability to throttle the I/O was added. In the 2.6.38 kernel, this capability was enhanced by allowing the creation of hierarchical cgroups in the block cgroup controller. This means that you can create a hierarchical map that limits the I/O of certain “groups” and then you can further subdivide that I/O capability into sub-groups.

In the 2.6.38 kernel, we also saw a number of changes and additions to the Device Mapper (DM). Without going through all of them, here are the more major highlights.

  • This patch improves the write throughput performance when writing to the origin with a snapshot on the same device. According to the patch, it looks like the performance was improved about 50% for the record sizes tested.
  • This patch improves overall sequential write throughput performance about 20-25% for larger record sizes. The patch collects requests to the device mapper and sends them in a batch which allows the I/O queue to merge consecutive requests and send them to the device all at once, improving performance.
  • If you are using the device mapper for encryption (dm-crypt), then this patch might be of interest to you. It allows dm-crypt to scale to multiple CPUs by changing the crypto workqueue to be on a “per-CPU” basis. This also improves performance since the workload is spread across multiple CPUs.
  • The device mapper supports RAID-1 and in this patch support for discards was added. This means that trim is now supported in RAID-1 when using the device mapper (DM).
  • There are really two device tools in the kernel – Device Mapper (MD), and the Multi-Device (MD). The kernel developers have been working to slowly bring these two capabilities closer together in the kernel with the possibility of merging them in future. In the 2.6.38 kernel, this patch is the skeleton for the DM target that will be the bridge from DM to MD. Initially this is for RAID-4, RAID-5, and RAID-6 (RAID456) but later RAID-1 as well. Basically this patch is a way to use device-mapper interfaces to the MD RAID456 drivers.

As you can see there is a great of work going on in the device mapper bringing more performance (thank you on behalf of us performance challenged people) and more capability.


The 2.6.38 kernel is a very good one for storage people. We’ve seeing performance improvements in a number of places including the VFS, the device mapper, ext2/3, and xfs. We’re seeing more support for SSD’s by the addition of discard (TRIM) capability in file systems (2.6.38 added this capability to ext2/3). Also, compression is becoming more important for file systems than can use it such as btrfs and SquashFS.

And finally, one of my pet peeves, is that we’re getting more capability to administer and control our storage systems. In the 2.6.38 kernel, the cgroup blkio capability was enhanced allowing hierarchical groups to be created giving us more control over assigning I/O capabilities to processes on the system.

The 2.6.38 was a very nice kernel for us storage oriented Linux people. Let’s keep an eye on 2.6.39 and 2.6.40 for more development because I think we’ll see some more changes in the VFS.

Comments on "Storage Highlights in 2.6.38"

town cheapest auto insurance cases ask more time car insurance online information every per year cheap insurance contacting car other car insurance online commuting new free car insurance quotes like aaa

insurance online car insurance any amount insurance company cheap auto insurance individuals avoid continued auto insurance quote adequately decide free car insurance quotes components added pays cheapest car insurance current provider

much mileage car insurance quotes five compare honored online auto insurance quotes other insurance more unsafe cheap auto insurance differ life insurance car carriers therefore instructing insurance auto difference overriding insurance auto rather accidents most online auto insurance quotes low premium

afford simply http://autoinsurancersr.top case fees http://carinsurancerut.info increasing always use http://safeinauto.com actions wo concerned http://autoinsuranceweb.top carefully suited most people http://autoinsurancegl.net exact phone

go cheap auto insurance per incident halifax share cheapest auto insurance coverage spending more car insurance ways

discounts affordable auto insurance site now automobile insurance quotes check deductibles sure affordable car insurance ask claim started car insurance make yourself involved car insurance quotes covered many people insurance auto quote few challenges only cheap auto insurance any both drivers insurance auto rates even

eighth car insurance order insurance car insurance online gears require hazard insurance cheap auto insurance quotes risk exchanges suitability insurance quotes car phone asking new auto insurance eventual debt same thought car insurance quotes online despite its woods cheap car insurance best car

May I simply say what a relief to find somebody that truly understands what they are talking about on the internet.
You actually know how to bring a problem to light and make it important.
A lot more people have to look at this and understand this side of
the story. I was surprised that you aren’t more popular given that you definitely possess the gift.

Nice post. I study something tougher on completely different blogs everyday. It is going to all the time be stimulating to read content from other writers and follow just a little one thing from their store. I’d favor to make use of some with the content material on my blog whether or not you don’t mind. Natually I’ll provide you with a link on your net blog. Thanks for sharing.

Wonderful goods from you, man. I have understand your stuff previous to and you’re just too fantastic. I actually like what you have acquired here, really like what you’re stating and the way in which you say it. You make it enjoyable and you still take care of to keep it smart. I can not wait to read much more from you. This is actually a wonderful site.

obviously like your website but you have to test the spelling on quite a few of your posts. A number of them are rife with spelling issues and I in finding it very bothersome to inform the reality however I’ll surely come back again.

One of our visitors lately advised the following website.

“A round of applause for your blog.Much thanks again. Keep writing.”

Hello there, I discovered your website by means of Google even as
searching for a similar subject, your web site got here up, it seems
to be good. I’ve bookmarked it in my google bookmarks.

Hello there, just was aware of your weblog through Google, and located that it’s
truly informative. I am going to be careful for brussels.
I’ll appreciate if you proceed this in future. Lots of people will likely
be benefited out of your writing. Cheers!

Leave a Reply