If you blinked you might have missed the announcement of the new 2.6.34 kernel. Things have been happening very quickly around file systems and storage in the recent kernels so it's probably a good idea to review the kernels from 2.6.30 to 2.6.34 and see what developments have transpired.
Do You Feel That? It’s the Quickening!
Sorry for using an obviously “geeky” title but I haven’t seen many Highlander quotes in a while and the “quick” pace of Linux kernel development caused me to recall the movie. Regardless, when you are doing day-to-day work on your wonderful Linux system, it is easy to think that the pace of kernel development is somewhat slow. However, if you step back and watch the pace of development, it is truly remarkable, particularly around file systems and storage.
Only yesterday it felt like the 2.6.30 kernel came out (June 9, 2009 – almost a year ago) and we are already up to 2.6.34 (May 17, 2010) with 2.6.35 patches waiting at the gate like an excited thoroughbred waiting at the post. From 2.6.30 to 2.6.34 there has been a great deal of kernel development that impacts file systems and other aspects of Linux storage. Even if you are using a kernel from a distribution and not “rolling your own” it is important to understand what has been happening in these kernels and how it can impact your system(s). The reason is that new versions of popular distributions are coming out and you need to understand them. What kernel are they using? Does your distribution have write barriers turned on by default? Does your distribution allow LVM to support write barriers? Does you distribution still support the anticipatory IO scheduler? Inquiring minds want to know but more importantly, we all need to know.
Before we just jump into the 2.6.30 kernel and review the changes just prior to the 2.6.30 kernel.
There were actually quite a few significant changes to the kernel around storage prior to version 2.6.30 of the kernel that really affect later kernels that this article will cover. The best place to start is probably with the 2.6.28 kernel.
In the 2.6.28 kernel, ext4 had the “experimental” label removed declaring it stable. Actually the date was Oct. 11, 2008, but the 2.6.28 kernel didn’t come out until Dec. 25, 2008. People had been waiting for ext4 for some time because it increased the maximum file size and the maximum file system size as well as performance of the ext family of file systems. To review ext4 you can go back and read about it here. In 2.6.28 their wish was granted.
Then the 2.6.29
kernel popped out on March 23, 2009 and it too had some new developments around storage for Linux. On the file system front btrfs
were added to the Linux kernel. Btrfs, as we are all probably aware
, is the next “it” file system for Linux. It was added to the 2.6.29 kernel with the “experimental” label to help increase the amount of testing it receives as well as to ease the inclusion of patches which could have gotten quite large if it was included in later kernel versions. It is still under heavy development as of the 2.6.34 kernel but it is rapidly evolving/developing.
As mentioned in the article about squashfs, it is primarily designed for embedded systems where space can be very important. Squashfs takes a given file system tree (it can be a subtree) and creates a very compressed image of that tree. Then you can mount that image in place of the tree, reducing the amount of space required but at the cost of the tree being read-only. If you combine squashfs and unionfs, you can actually create what appears to the user as a read-write tree (very cool stuff – give it a try).
The 2.6.29 kernel added a no journaling option to ext4 so you can run ext4 without a journal (goggle is doing this). OCFS2 added metadata checksums to improve data reliability. And of course there are always updates to other file systems.
While you don’t find mention of it on the kernel newbies site, one very important addition in the 2.6.29 kernel was that all write barriers will be respected by LVM. Prior to that kernel write barriers were ignored by LVM. While you may complain that write barriers impact performance, and they do, they can also save your bacon in regard to file system corruption.
Now that the stage has been set, let’s start reviewing kernels starting with 2.6.30 kernel (aka’ “chock-full-o-filesystems”).
2.6.30 – Wow!
The 2.6.30 kernel was loaded with new file systems and other file system developments. NILFS2, pohmelfs, and exofs were all added to the kernel in this version. Preliminary support for NFS 4.1 (aka’ pNFS) were also added. But there were some other changes that are worth discussing so read on.
NILFS2 is a different type of file system which is termed a log-structured file system. You can read a summary of it here. Rather than write to a tree structure such as a b-tree or an h-tree, either with or without a journal, a log-structured file system writes all data and metadata sequentially in a continuous stream that is called a log (actually it is a circular log). Because of this design it is very easy for NILFS2 to create snapshots and mount them along side the file system itself. But one of the more desirable features of NILFS2 is performance.
The design of log-structured file systems such as NILFS2 means that they can perform very well on SSD storage devices. (yeah – performance!) An additional cool results of the log design is that a log-structured file system recovers from a crash extremely fast and the amount of time is independent of the size of the file system
Pohmelfs (Parallel Optimized Host Message Exchange Layered File System) is a file system designed to improve upon the performance of NFS. In the 2.6.30 kernel it was added to the “staging” area so it’s not considered stable at this time. It is a parallel distributed file system that is focused on improving the performance of NFS. It has the potential for better performance due to its design but it also has a number of features designed to help performance such as local caching.
Exofs is an object-based file system, which is, I believe, the first to be added to the Linux kernel. Object oriented file systems are a third option to block-based or file-based storage. The concept is to take the data, add some metadata, and then let the storage hardware handle where the combination is placed. This means that the operating system just interfaces with the objects and not the devices. Note that the devices need to conform to the OSD T-10 standard SCSI command set (OSD = Object Storage Device).
Object oriented file systems hold a great deal of promise for storage. There are many reasons why and the list far exceeds this confines of this column, but there is a great deal of excitement around them for solving large scale storage problems and making storage easier. Exofs may be the first object-based file system to be added to the kernel but hopefully it is not the last.
Another important development in 2.6.30 was the inclusion of preliminary developer support for NFS v4.1. While it sounds like a minor version update from NFS v4.0, it actually is a huge change from 4.0 – it adds support for pNFS or Parallel NFS. (yeah – performance!) Recall that NFS is the only standard file system but it has performance issues. PNFS takes NFS to the next level and creates a parallel distributed file system with the promise of improving performance. Keep an eye on NFS v4.1 since it is the only standard parallel distributed file system.
So far the primary theme of the 2.6.30 kernel is new file systems. However there are other developments that are very important. One of them is the inclusion of a client-side caching system for networking file systems such as AFS and NFS. FS_Cache is the interface between the file system and the cache allowing the file system to be cache agnostic. CacheFS itself is the caching back end for FS-cache.
Using FS-Cache and Cache-FS with something like NFS can be effective for some workloads. As described in an earlier article in some cases FS-Cache can be used effectively but the key thing is that the data has to be cached on the client. For example, the data has to be created on the client or copied to the client putting the data into the client cache. Then if the data is accessed again it will be used from the cache rather than the server.
Around the time of the 2.6.29 release there erupted a great discussion (argument) around the use of fsync and the possibility of getting a file zeroed. The arguments became quite fierce but as a result there were a few changes to file systems and to the 2.6.30 kernel. Please read about the changes since they can affect performance and data integrity.
A small, somewhat unnoticed change in the 2.6.30 kernel, is the ability to support lzma compressed kernel images. This can be important to storage because there is a version of squashfs that supports lzma and now lzma was in the kernel (anyone see the connection?). If you create tree images using this version of squashfs, you will get greater compression, saving more space. (yeah – capacity!)
Whew! That’s a great deal of new stuff for storage in one kernel version. Things slowed a bit with the release of 2.6.31 but that doesn’t mean there weren’t any changes worth mentioning.
Next: 2.6.31 – Steady as She Goes