Metadata performance is perhaps the most neglected facet of storage performance. In previous articles we've looked into how best to improve metadata performance without too much luck. Could that be a function of the benchmark? Hmmm...
If you have been reading the articles in this column for any amount of time you will see a number of them focus on metadata performance. The primary reason for focusing on metadata performance is that it is one of the most critical aspects of overall performance and is one of the most neglected aspects. When people talk about drive or storage performance it is invariably given in terms of MB per second or something similar. In other words – throughput. However, people do not realize how important metadata performance is to overall performance.
Metadata performance refers to the how quickly files and directories can be created, removed, their status checked (stat), as well as other data functions. This aspect of storage performance is becoming more important because of the increasing number of files and directories on systems. Creating files, deleting them, and performing a status check of them is important for more applications than ever before. There are applications that can produce millions of files in a single directory and applications that create very deep and wide directory structures. As the number of cores increases the number of files and number of directories increases putting more and more pressure on the metadata performance of storage solutions.
In previous articles a simple metadata benchmark, fdtree, was used to measure benchmark performance. We saw that many times there was not a lot of change in performance when the file system metadata performance was changed. This led me to ask the question, is fdtree a good enough benchmark to show metadata performance as one tweaks a system for performance? If you like analogies, it’s like wondering if you doctor has the right diagnosis. Since I don’t always trust single opinions, I always like to get second and even third opinions. So in this article I want to examine a new metadata performance benchmark, run it on the same hardware as past articles that used fdtree, and see if they both show the same trends.
A common benchmark used for HPC storage systems is called metarates. Metarates was developed by the University Corporation for Atmospheric Research (UCAR). It is a MPI application that tests metadata performance by using POSIX system calls:
- creat() – (open and possibly create a file)
- stat() – (get file status)
- unlink() – (delete a name and possibly the file it refers to)
- fsync() – (synchronize a file’s in-core state with storage device)
- close() – (close a file descriptor)
- utime() – (change file last access and modification times)
Using these system calls, the main analysis options for metarates are the following:
- Measure the rate of file creates/closes (create/close per second)
- Measure the rate of utime calls (utimes per second)
- Measure the rate of stat calls (stats per second)
Metarates has options for the number of files to write per MPI process (remember that you will have N processes with a MPI application where N is a minimum of 1) and if the files are to be written to a single directory or to many directories. It also has the option of using the system call fsync() to synchronize the file’s in-core state with the storage device. But fsync has been something of a controversial subject in Linux so let’s cover fsync() for a moment.
The system call fsync is really designed to flush the buffer cache data to the storage devices (e.g. disks) to make sure that the data is actually on the disk and not in a buffer where a power lose would mean losing data. According to the eminent Theodore Ts’o fsync() does the following:
"... the only safe way according that POSIX allows for requesting data written to a particular file descriptor be safely stored on stable storage is via the fsync() call. ..."
However, if you read the man pages for fsync() (“man 2 fsync”), you will see the following,
"Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed."
So fsync appears to be the best way for an application to ensure that the data has actually reached the disk but you still have to be careful if you want to be 100% sure that the data is actually on the disk.
If you want to try a simple experiment examine the IO output of your applications and look for calls to fsync. I think you would be surprised at how many applications do not use use it.
The test system used for these experiments was a stock CentOS 5.3 distribution but with a 2.6.30 kernel and e2fsprogs was upgraded to 1.41.9. The tests were run on the following system:
- GigaByte MAA78GM-US2H motherboard
- An AMD Phenom II X4 920 CPU
- 8GB of memory
- Linux 2.6.30 kernel
- The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ulta ATA/100)
- /home is on a Seagate ST1360827AS
- There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are
Only the first Seagate drive was used,
/dev/sdb for all of the tests. The second drive,
/dev/sdc was used only for the tests where the journal was placed on a drive.
As with previous articles, ext4 is used as the file system. The journal can be easily moved to a different device so that the impact of the journal location on performance can be determined. The steps for creating the file system and the journal are covered in previous articles.
Metarates Command Parameters
The test system has 4 cores and Metarates is an MPI application allowing us to choose the number of processes (cores) we use during the run. So for this benchmark 1, 2, and 4 cores were used (three independent tests).
Not forgetting our good benchmarking skills, the run time (wall clock time) of the runs should be greater than 60 seconds if possible. So the number of files was varied for 4 MPI processes until a run time of 60 seconds was reached. The resulting number of files from the test was found to be 1,000,000 and was fixed for all tests. Also it was arbitrarily decided to have all files are written to the same directory with the goal of really stressing the metadata performance.
The final command line used for metarates for all three numbers of processors (1, 2, and 4) is the following.
time mpirun -machinefile ./machinefile -np 4 ./metarates -d junk -n 1000000 -C -U -S -u >> metarates_256MB_ramdisk.np_4.1.out
Where the “-np” option stands for number of processes (in this case 4), “-machinefile” refers to the list of hostnames of systems to be used in the run (in this case it is a file name “./machinefile” that contains the name of the test machine repeated 4 times – once for each process), and the results to stdout are sent to a file “metarates_256MB_ramdisk.np_4.1.out” which is an example of how the output files were named.
Notice that fsync wasn’t used in the benchmark run so that the results would be comparatively similar to those of fdtree which didn’t explicitly fsync() files. In addition, the directory where the files are written is the current directory where the binary (“./metarates”) is located. Also notice that three different performance measures are used:
- File create and close rate (how many per second)
- File stat rate (how many “stats” per second)
- File utime rate (how many “utimes” per second)
Next: Running Metarates