The last article talked about the anatomy of SSDs and the origins of some of the their characteristics. In this article, we break down tuning storage and file systems for SSDs with an eye toward improving performance and helping overcome some of the platform's limitations.
Another option for improving the performance of SSDs, or perhaps better said, taking advantage of the SSDs characteristics, is to tune the file system. This can mean choosing the parameters of the file system when it is built and/or mounted, or even choosing the file system itself. As with choosing the IO scheduler and possibly tuning it, the performance results achieved by file system tuning will depend upon the workload.
Remember that one of the more important aspects of SSDs is that block erase size. This is the amount of space that must be erased if any data in that region has to be changed. For many SSDs the typical block erase size is 512KB. Consequently, there has been some suggestions around aligning the partition around this size.
One of the most suggested tuning tips is to align the partition with the erase block size of the SSD. This means you have to use fdisk to create the correct partitioning. There are a couple of very good postings about this subject. Theodore Ts’o recently wrote a blog about aligning partitions and file systems to block erase boundaries. Also there is a very good posting to the OCZ forum that also talks about partition alignment.
Aligning the partition size to the erase block size is not easy. The “fdisk” tool in Linux thinks in terms of cylinders, heads, and tracks or sectors. You have to translate these concepts into KB’s because that’s how the block erase size is expressed. Let’s assume that the alignment is based on 512KB boundaries to match the block erase size. To calculate the block size you following the equation:
Block Size (in KB) = [ (number of heads) * (number of sectors/tracks) * (number of bytes per sector) ] / 1024
The number of sectors per cylinder, a common measure of the partition, is given by the following:
Sectors per Cylinder = [ (number of heads) * (number of sectors/tracks)]
For this example of 512KB boundaries we could chose 32 heads and 32 sectors/tracks. This results in the following:
Block Size (in KB) = [ (number of heads) * (number of sectors/tracks) * (number of bytes per sector) ] / 1024
Block Size (in KB) = [ 32 * 32 * 512 ] /1024
Block Size (in KB) = 512KB
This link lists some head and cylinder combinations to achieve certain boundaries. To make sure that the boundaries are met by the partition, start on the second cylinder as shown in this posting.
In a blog by Theodore Ts’o he aligned the partition on 128KB boundaries but also give a very good discussion about the interactions with LVM and then ultimately the file system, assuming it’s an ext type file system. This also includes how to handle the journal in the file system which impacts the layout of the file system. All of these considerations can drive the partition alignment, how LVM is configured, and how the file system is created.
Let’s assume you are aligning on a 512KB block erase size and the you are using 4KB block sizes in your file system. To take advantage of the alignment, you need to specify the stripe width of the file system. For example, in ext4, this can be done by the following.
# mke2fs -t ext4 -E stripe-width=128 /dev/sdd
…where /dev/sdd is used as an example path to the SSD device. Note that the reason the stripe width of 128 was chosen is so that the block size multiplied by the stripe width results in the block erase boundary size. This also works for ext2 and ext3 file systems. Other file systems usually have a similar option.
However, as Theodore points out, there may not be as much reason to align the partition on the block erase boundaries with latest SSD drives. In another excellent blog post he discusses some testing he did with ext2 and ext3 on a fairly recent SSD. He found that the journaling option added about 4% – 6% to the time for his workload.
Overall, the jury is still out on the efficacy of partition alignment and the evidence is likely dependent on the workload. But in the short term is does not hurt anything.
It is amazing how many articles there are about using the noatime option with file systems. This option tells the file system not to update the access time of the file, where access time is, logically enough, the date when the file was accessed (perhaps opened and read). When a file is accessed, the metadata for the file is modified to reflect the new access time (atime). The file could be terabytes in size, but a very small amount of the metadata has to be updated when the file is accessed at any level (1 byte or the entire 1 TB).
The noatime option is typically used with the mount command or in /etc/fstab to tell the file system not to change the access time on existing files Naturally new files have their access time updated since the file is new.
Many people advocate using the noatime option because it avoids some very small writes, helping overall performance. A counterpoint to this argument is that the access time can provide very useful information. Fundamentally, it tells you when the file was last used. If it’s been a long time since the file was used (insert you own definition of “long time”), then the perhaps it’s time for the file to be archived saving space including backup space.
Whether or not you use the noatime option is up to you. The author highly recommends using it where it makes sense. For example, for home directories, it might not make sense because you have web browsers and email tools with a huge number of small files that are accessed fairly frequently including web cache files. Accessing small files and then updating the metadata, especially for cache files, is a rather pointless exercise that just adds load to the file system and the underlying storage. However, for project data (work data), using access time can be exceedingly important. This difference lends itself to using two different mount points and “forcing” users to put their work data in a different mount point.
How does this impact SSDs? The answer should be fairly obvious – you strive to avoid rewriting data on SSDs that is very small because it causes a complete rewrite of the entire block. So if an access time of a file is updated, then the entire 512KB block where the metadata resides, has to be copied, the updated access time merged into the copy, the SSD block is erased, and then the updated data is written to the block. This can take not only a great deal of time, but you’ve also used a write cycle just to update a teeny tiny portion of the 512KB block. Seems like a waste, doesn’t it? Imagine doing that for simple cache data for Firefox? It sure makes the concept of access time seem pointless.
Theodore Ts’o has a blog where he presents some evidence for turning off access time (noatime). It helps with the workload times and it helps reduce the number of write cycles.
However, remember that in some cases the access time can be extremely important. So the point is to judiciously select when you use atime and when you don’t.
File System Choice
Another option for improving the performance of SSDs is the choice of a file system. An early rule of thumb was not to use a file system with a journal because of the concern of wearing out a portion of the SSD too quickly, particularly because of the write amplification problem. However, as write amplification issues were addressed, the need for journal-less file systems diminished. Theodore Ts’o, there’s that name again, discusses this in his blog. However, there is still some choice in file systems.
As discussed earlier, researchers from Texas A&M found that random write performance of SSDs can suffer severely as the record size decreases. So in choosing a file system, it might be a good idea to find one that does fairly well on random write performance. This points to the possibility of a log based file system such as NILFS.
NILFS is a log-based file system was discussed in a previous article. Recall that a log based file system uses a simple log or circular buffer to handle all data and metadata. This means that there is virtually no over-writing of data. New data is just pushed to the back of the log while the file system writes the head of the log moving sequentially through the log. If data is changed, the new data is pushed to the log and the old data is erased as part of a garbage collection operation in the file system.
NILFS has a great potential for exploiting SSDs for performance. In Feb. 2008, there was a presentation by Dongjun Shin from Samsung as part of the Linux Storage & File System Workshop 2008 (LSF ’08). He benchmarked NILFS, Btrfs, Ext2, Ext3, Ext4, ReiserFS, and XFS when running on an SSD device. Granted that the testing is a little old, but the results are very, very exciting. The benchmark, Postmark, simulates an email server. Two groups of files sizes were tested, (1) 9 – 15KB (S), and (2) 0.1 – 3MB (L). For each group, two tests were run with a small number of files (S), and a larger number of files (L). Figures 1 and 2 below are the test results.
Figure 1: Postmark Results for Small File Size
Figure 2: Postmark Results for Large File Size
Notice that in both cases, the performance of NILFS exceeds that of other file systems. For small files NILFS was about 25-38% faster than the nearest competitor (btrfs). For large files NILFS was about 15-25% faster than the nearest competitor (reiserfs and/or ext4).
Recently, Valerie Aurora wrote about SSDs and log structured file systems. The article is very good and worth reading to get into more details about why something like NILFS can be a big boost for your SSD.
Other File Systems
You can use any of the usual Linux file systems, ext2/3/4, jfs, xfs, reiserfs, etc., as you wish for your SSDs. Phoronix published some benchmarks on an Intel SSD drive for various Linux file systems for various workloads. The benchmarks were only run once and the details of the tests weren’t published ignoring good benchmarking habits (although you can download their benchmarks since they are open-source). But the results are relatively interesting nonetheless.
In looking at the results, the read performance for the various file systems tested (ReiserFS, JFS, XFS, ext, and ext4) is about the same except for JFS which seems to be a little slower. The write performance for the various tests is also pretty much the same (except the 4GB read test) until you hit the IOzone benchmarks where xfs and ext4 pull ahead of the others with xfs slightly ahead of ext4. For the Intel IOmeter tests, all of the file systems with the exception of jfs and xfs did very well.
While the Phoronix benchmarks didn’t follow the good habits of benchmarking and it is impossible to tell what kind of variation exists in the tests, they do point that out that there are some fairly significant differences in file system performance on SSDs.
File systems are starting to gain some SSD optimizations to help SSDs perform better but only when it makes sense from both the perspective of the file system and the SSD. An easy example is btrfs which has some beginnings of SSD support that know about blocks.
The esteemed Theodore Ts’o wrote a blog discussing whether it’s worth optimizing file systems for SSDs. While Linus has said that he views it as a dumb idea, Theodore thinks it might have legs in some cases. He does make some valid points in that “dumb” SSDs (i.e. not Intel SSDs) are much cheaper than “smart” SSDs (about half the price) so it might justify the development of file systems that understand SSDs and know how to take advantage of the attributes of the drives and accommodate for their weaknesses so that “dumb” SSDs can be used.
However, in general, at this time, choosing one file system over another for SSDs is rather difficult. NILFS bares some examination but among the others it’s impossible to say that one is better than the others on SSDs. As with everything in the storage world, the best solution depends upon your workload.
Concept for SSD utilization
Some people have tested SSDs on their workload and some times they have been less than happy with the results. In some cases it might be possible to get better performance from an SSD compared to a hard drive, but there are cases where the SSD didn’t perform as well (sometimes this also depends upon the SSD). Moreover, there are cases where the SSD did perform better than a normal hard drive but not enough to justify the cost of the SSD. However, there might be methods for combining SSDs with regular spinning media to achieve the performance you need or require for your application.
SquashFS and UnionFS
A previous article talked about combining SquashFS, which is a compressed read-only file system, with a read/write file system, such as ext2/3/4, using UnionFS to achieve what the user thinks is a single file system. The advantage of such a combination for SSDs is that they have amazing read performance.
The idea is to take a “snapshot” of a user’s home directory and create a compressed image of it using SquashFS. Then this image is stored on an SSD since the image is read-only and it is mounted on the system. Any read access to the data in the image will use the huge read performance boost of the SSDs. The image on the SSD is combined with a hard drive that stores any new data (writes). If a user updates any data in the SquashFS file system, the data is changed and put on the hard drive so that the user only sees the new data, not the old data in the SquashFS image. Then later, you can add the changed data from the hard drive to the previous image to capture all of the changes. This combination should work better for workloads have a reasonable amount of read IO.
Administratively it will take some work to construct this combination. But it’s fairly easy to create a script that locks a user’s account, creates the SquashFS snapshot, and remounts the combination.
Alternatively, you could track the access times for the various files in a user’s account. If they are of a certain age, then you could copy the files to a new directory within the user’s account, perhaps called “ssd”, keeping the same directory structure but using a new root for the tree. Then symlink the files in the “ssd” directory to the original location. Then you can easily just create a SquashFS image of the “ssd” subdirectory and combine with the user’s /home. If the “ssd” directory isn’t too large then you can even share the SSD among users. But, well you get the idea.
SSDs have some unusual properties, some of which are very good and some which make things challenging. They potentially have great performance, particularly read performance, and are very shock-resistant, and can sometimes use less power. But at the same time, they have asymmetric performance (reads are much faster than writes) and they have a limited number of rewrite cycles before they can no longer retain data.
In this article, concepts around tuning the layers in Linux from the application to the SSD itself are explored to see if there are ways they can be tuned to improve SSD performance. Changing the IO scheduler as well as partitioning the drive for block erase alignment were explored with some evidence that both concepts can help improve performance.
Using the “noatime” option with the file system was also explored. Theodore Ts’o has shown some evidence that this option can improve performance for SSDs as it does for hard drives. However, it is highly recommended that you consider using separate mount point so that access times can be used where they are very valuable (i.e. statistics on file access – when was the last time a file was touched?).
There was also some discussion around the choice of the file system itself. While there isn’t anything evident in the usual file system suspects to make one stand out over the others for use on SSDs, there is one that warrants a second look – NILFS. NILFS is a log based file system which can help with random write performance with which SSDs have a very difficult time. Since a log based file system writes all data, even metadata, sequentially, performance on an SSD should be good. A paper from Dongjun Shin showing just how good the performance can be was presented, indicating that NILFS is worth investigating.
Finally, a concept of using SSDs for storing read-only SquashFS images combined with hard drives for write data using UnionFS was presented. This concept plays to the strengths of each storage media – SSDs for reads, and hard drives for random writes. It is fairly easy to construct scripts to create SquashFS images within user’s accounts that can be updated if the data in the image is changed.
SSDs are a great disruptive technology that are forcing people to rethink storing data (always a good thing) but they are not without limitations. Hopefully this article has shown that there are ways to tune your systems to better utilize them.
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).