Our two prior articles have detailed the performance results from a new patch, bcache, that uses SSDs to cache hard drives. We've looked at the throughput and IOPS performance of bcache and -- while it is still very new and under heavy development -- have found that in some cases it can help performance. This article examines the metadata performance of bcache hoping to also find areas where it can further boost performance.
Introduction
In the last two articles I’ve examined the performance of a new kernel patch called bcache. This patch is designed to use SSD’s to cache block devices (hard drives or RAID arrays) with the goal of improving performance (and, really, who doesn’t like performance?). In the first performance article it was found that there are some workloads where bcache helped the throughput performance such as record rewrite, random write (not a lot of improvement but it is noticeable), and strided read (some improvement) compared to a single disk.
From the second article we also saw that there are some IOPS workloads that can benefit from bcache. For example, sequential write IOPS (in the case of 128KB records) and sequential read IOPS (usually larger record sizes, particularly 128KB) both saw some reasonable improvements from bcache. But we also saw that bcache hurt random IOPS (both read and write).
While you may be disappointed with the performance of bcache I don’t think the performance is bad. Remember that the patch is still very early in development so much of the tuning that takes place in patches is still going on in earnest. As a parallel example, ext3 has been in the kernel for almost 9 years and there are is still some development on it to improve performance (primarily for workloads that people encounter). At this point, for a brand new patch that falls into the data path, just getting the basic features and functionality in the patch without it corrupting data is major accomplishment. At the same time some of the features, such as write through caching, that would help performance for some workloads, aren’t there (yet).
So cut the patch a break but don’t lose heart and don’t sweep bcache under the patch rug. Test, test, and ask for features! (oh and write patches if you are able).
The first two article showed that throughput performance and IOPS performance was improved compared to a single uncached disk by using bcache for several specific workloads but in some cases, it was also worse than a single uncached disk. This article adds the third common performance measure, metadata performance, to our examination of bcache.
Examining Metadata Performance
Many people consider metadata performance to be the “unwanted relation” of storage performance. But for many workloads it can be one of the most important factors in determining overall performance.
As I mentioned in previous articles, I have quit saying “never” when it comes to applications and I/O patterns. Every time I think that I have seen the epitome of strange I/O patterns along comes a new one that does something even stranger. I don’t necessarily blame the application developer because they are usually focused on solving a particular problem, not ensuring that the IO pattern works well with current storage. In fact they may not even know what good practices are, and are not (that is the subject of another article when I get really cranky).
If you read my byline you will learn that I work in the world of HPC (High Performance Computing). I recently encountered an application that wrote millions of very small files either in a single directory or in thousands of subdirectories. Interestingly the total capacity of the data was under 1 TB. This kind of workload puts a great deal of pressure on the storage, particularly the file system. Even worse, while the application was running it would periodically do something fairly simple – “ls -l” – to determine when the IO was finished or if a specific file was created. It did this a very large number of times before moving to the next phase of computation. It’s pretty obvious that this application put the metadata performance of the file system under a great deal of pressure to the point where it became the bottleneck in storage performance.
The moral of this tale is, “Don’t ignore metadata performance.” So in this article I’m going to examine metadata performance by using a benchmark called metarates.
Metarates
A common benchmark used for HPC storage systems is called metarates. Metarates was developed by the University Corporation for Atmospheric Research (UCAR). It is a MPI application that tests metadata performance by using POSIX system calls:
- creat() (open and possibly create a file)
- stat() (get file status)
- unlink() (delete a name and possibly the file it refers to)
- fsync() (synchronize a file’s in-core state with storage device)
- close() (close a file descriptor)
- utime() (change file last access and modification times)
Using these system calls, the main analysis options for metarates are the following:
- Measure the rate of file creates/closes (file creates/closes per second)
- Measure the rate of utime calls (utime operations per second)
- Measure the rate of stat calls (stat operations per second)
Metarates has options for the number of files to write per MPI process (remember that you will have N processes with a MPI application where N is a minimum of 1) and if the files are to be written to a single directory or to many directories. It also has the option of using the system call fsync() to synchronize the file’s in-core state with the storage device.
Metarates Command Parameters
Metarates is an MPI application allowing us to choose the number of processes (cores) we use during the run. So for this benchmark and this test system (more on that in a subsequent section), 1, 2, and 4 cores were used (three independent tests). These tests are labeled as NP=1 (1 core), NP=2 (2 cores), NP=4 (4 cores) where NP stands for Number of Processes.
Not forgetting our good benchmarking skills, the run time (wall clock time) of the runs should be greater than 60 seconds if possible. So the number of files was varied for 4 MPI processes until a run time of 60 seconds was reached. The resulting number of files from the test was found to be 1,000,000 and was fixed for all tests. Also it was arbitrarily decided to have all files are written to the same directory with the goal of really stressing the metadata performance.
The final command line used for metarates for all three numbers of processors (1, 2, and 4) is the following.
time mpirun -machinefile ./machinefile -np 4 ./metarates -d junk -n 1000000 -C -U -S -u >> metarates_disk.np_4.1.out
where the “-np” option stands for number of processes (in this case 4), “-machinefile” refers to the list of hostnames of systems to be used in the run (in this case it is a file name “./machinefile” that contains the name of the test machine repeated 4 times – once for each process), and the results to stdout are sent to a file “metarates_disk.np_4.1.out” which is an example of how the output files were named.
Notice that three different performance measures are used:
- File create and close rate (how many per second)
- File stat rate (how many “stat” operations per second)
- File utime rate (how many “utime” operations per second)
Test System
The tests were run on the same system as previous tests. The system highlights of the system are:
- GigaByte MAA78GM-US2H motherboard
- An AMD Phenom II X4 920 CPU
- 8GB of memory (DDR2-800)
- Linux 2.6.34 kernel (with bcache patches only)
- The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ultra ATA/100)
- /home is on a Seagate ST1360827AS
- There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are
/dev/sdb
and /dev/sdc
.
Only the second Seagate drive was used, /dev/sdc
, for the file system. Since the version of bcache I used could not yet cache a block partition, I used the whole device for the file system (/dev/sdc
).
There are four storage configurations:
- Single SATA II disk (7,200 rpm 500GB with 16MB cache)
- Single Intel X25-E SLC disk (64GB)
- Bcache combination that uses the Intel X25-E as a cache for the SATA drive and uses the CFQ (Completely Fair Queuing) IO Scheduler that is the default for most distributions
- Bcache combination that is the same as the previous but uses the NOOP IO Scheduler for the SSD that many people think could help SSD performance.
To learn more about creating the file systems please see the first article. There are several specific details about creating the file systems that can be important. The details of the benchmarks and tests run are in the next section. Once again, we will be using our good benchmarking techniques in this article.