Benchmarking has become synonymous with marketeering to the point it is almost useless. This article takes a look at a very important paper that can demonstrate how bad it has become and makes recommendations on how to improve the situation.
“There are lies, damn lies, and then benchmarks.” It’s an overused phrase but it does make a point: Benchmarks have become so abused that no longer are they used to provide useful information for making decisions or for improving solutions. While this is something we seem to inherently know, it was recently backed up with solid data. The clichÃ©, it turns out, is on the mark.
Recently a paper was published in Byte and Switch that examined nine years of storage and file system benchmarking. It’s an important paper. In this article we will summarise the paper’s findings. Looking ahead we plan to use them as the basis of future benchmarking results published in this column.
Yes, it’s that good.
Benchmarking is Just Another Word for Marketeering
Benchmarking is an activity intended to provide information about how fast a piece of software and/or hardware runs. Since this article is about file systems and storage, benchmarking means how fast the file system and/or storage solution performs IO operations.
Benchmarks can be very useful in helping present the benefit or problems of a system. However, rather than just present a simple table of graph of results and declare that one system is better than another, what is truly needed is a discussion of the tests and the systems, the reason behind the benchmarks, and a clear explanation of the results and implications. In particular the person reading the benchmarks should be able to verify the benchmark results themselves and then be able to compare the performance of one system to another.
To achieve these objectives the benchmarks much be well thought out; including advice on choosing suitable benchmarks, suitable hardware and configurations, and then providing as accurate results as possible.
However, over time benchmarks have been reduced to single graphs or tables with little or no explanation of them. In some cases the benchmarks are published naively and perhaps without adequate explanation (the author is guilty of this) sometimes in the interest of expediency. But in other cases benchmarks are published with little or no information about what was done solely to promote or detract from a particular product. In other words, benchmarks have become marketing material instead of usable information.
Nine-Year Review of Benchmarks
Recently there was a paper published by Avishay Traeger and Erez Zadok from Stony Brook University and Nikolai Joukov and Charles P. Wright from the IBM T.J. Watson Research Center entitled, “A Nine Year Study of File System and Storage Benchmarking” (Note: a summary of the paper can be found at this link). The paper examines 415 file systems and storage benchmarks from 106 recent papers. Based on this examination the paper makes some very interesting observations and conclusions that are, in many ways, very critical of the way “research” papers have been written about storage and file systems. These results are important to good benchmarking. And, stepping back from that, they make recommendations on how to perform good benchmarks (or at the very minimum, “better” benchmarks).
The research included papers from the Symposium on Operating Systems Principles (SOSP), the Symposium on Operating Systems Design and Implementation (OSDI), the USENIX Conference on File and Storage Technologies (FAST), and the USENIX Annual Technical Conference (USENIX). The conferences range from 1999 through 2007. The criteria for the selection of papers was fairly involved but focused on papers of good quality that covered benchmarks focusing on performance not on correctness or capacity. Of the 106 papers surveyed, the researchers included 8 of their own.
When selecting the papers, they used two underlying themes or guidelines for evaluation:
- Looking to see if the authors explained exactly what was done – providing details on the benchmarking process.
- Finding out if the authors just didn’t explain what was done, but justified why it was done in that particular fashion. For example, explaining why comparing file systems is fair or why a particular benchmark was run
Breaking Down Good Benchmarks
Repetition One of the simplest things that can be done for a benchmark is to run the benchmark a number of times and report the median or average. In addition, it would be extremely easy (and helpful) to report some measure of the spread of the data such as a standard deviation. This allows the reader to get an idea of what kind of variation they could see if they tried to reproduce the results and it also allows readers to understand the overall performance over a period of time.
The paper examined the 106 benchmark papers for the number of times the benchmark was run. The table below is from the review paper for all 388 benchmarks examined and is broken down by conference. Since most of the time the data was unclear, it was assumed that each benchmark was run only once.
Table 1 – Statistics of Number of Runs by Conference
It is fairly obvious that the dispersion in the data is quite large. In some cases the standard deviation is as large or larger than the mean value.
Runtime The next topic examined is the runtime of the benchmark. Of the 388 benchmarks examined, only 198 (51%) specified the elapsed time of the benchmark. From this data, it was found:
- 28.6% of the benchmarks ran for less than one minute
- 58.3% ran for less than 5 minutes
- 70.9% ran for less than 10 minutes
Typically run times that are short (less than one minute) are too fast to achieve any sort of steady-state value.
With 49% of the benchmarks having no known runtime and another 28.6% running for less than a minute, easily three-quarters of these results should cause some of your warning bells to start ringing. If there’s no data, it’s not a benchmark; it’s an advertisement.
Variety of Benchmarks The third topic examined was the number of benchmarks run in the papers. It was found that 37.7% of the papers used only one or two benchmarks. This makes it very difficult to understand the true performance of the system because a single benchmark presents only one aspect of the system.
After performing the qualitative examination of the papers and benchmarks, the authors proceeded to examine many of the common benchmarks. They divided the group into several pieces:
- Macro-Benchmarks with the following examples:
- Compile Benchmarksing (e.g. compiling the kernel)
- The Andrew File System Benchmark
- TPC (Transaction Processing Performance Council)
- SPEC (SFS, SDM, Viewperf, Web99)
- SPC (Storage Performance Council)
- Netbench and dbench (not used very often)
- Replaying Traces
- Micro-Benchmarks with the following examples:
- Bonnie and Bonnie++
- Sprite LFS
- Ad-Hoc Micro-Benchmarks
- System Utilities (e.g. “wc”, “cp”, “diff”, “tar”)
- Configurable Workload Generators
Popular Benchmarks != Correct Benchmarks
The researchers then decided to take the two most popular benchmarks, Postmark and Compile, and do some more quantitative analysis to examine how they functioned and what kind of information they could provide. To do this they took the ext2 file system and modified it to slow down certain operations (they called it SlowFS). They slowed down reads (reading data from disk), prepare write and commit write, and lookup. The slow down was variable depending upon mount point options. Please note that this type of slowdown exercises the CPU, not the actual IO.
For the compile benchmark they focused on compiling OpenSSH. The compile function is predominantly driven by read functions so they slowed down the read operations by a factor of 32. They found that even at these extreme factors, the execution time for the compile only increased by 4.5% I think this shows that compiling something is perhaps not the best benchmark since variations in storage or file system will be little noticed in the elapsed time. This benchmark is dominated by CPU time to do the actual compilation and not necessarily IO.
For the Postmark benchmark, they slowed down the previously mentioned operations by a factor of 4 (separately and together) for 3 different “configurations” or set of Postmark operations. The researchers found that using three different Postmark runs they got widely varying run times with Postmark – from 2 seconds to 214 seconds (the 2 second operation barely produced any IO). The other observation they made was that different sets of parameters for Postmark showed more of the SlowFS effects than others.
Eating Their Own Dogfood
Finally, the authors made some observations and conclusions from their work.
- They recommend that using the current set of available benchmarks that at least one macro-benchmark or actual application trace be tested as well as several micro-benchmarks. In essence, use both marco-benchmarks and micro-benchmarks and several several to better gauge the performance including where areas where the system performs well or doesn’t perform well.
- Benchmarks should definitely improve the descriptions of what was done as well as why it was done (this second point was emphasized in the paper).
- Furthermore the author’s offer the opinion, with good reason, that there should be some analysis of the the system’s expected behavior as well as various benchmarks that either prove or disprove the hypothesis (this goes to the “why” of the benchmark). This goes well beyond a simple graph or table that are so typically shown.
- The current state of performance evaluations has a great deal of room for improvement.
- The state the standards clearly need to be raised
- They also state that there needs to be better dissemination
- There need to be better and standardized benchmarks for file systems and storage testing
- Finally, the authors question the usefulness of standardized industrial benchmarks since they are usually used to report a single number, not to help characterize or benchmark a complex system (i.e. think of the usefulness of the TPC and SPC numbers you see – do they present any useful information to you?)
Summary and Observations
The authors of the paper took a wide range of research oriented benchmarks from reputable conferences and performed a qualitative analysis of them. The results are both extremely interesting and somewhat depressing. From a higher perspective they found:
- Much of the time, the benchmarks are run only once and in some cases the testing time is so short that the results may be of little use.
- There is little or no explanation as to why a benchmark was run
- There is little or no information about the run so that it could be repeated by someone else
- Some of the benchmarks may not be useful in helping to characterize or benchmark a storage system
In short: You’re doing it wrong.
The paper discussed some recommendations about ways to improve benchmarking which everyone should take to heart. In particular, benchmarks should be run multiple times and be presented with some sort of dispersion measure (e.g. standard deviation, etc). Perhaps more importantly when the benchmark results are published there should be a discussion of what is hoped to be shown with the benchmarks as well as why certain benchmarks were chosen. It is hoped that all future benchmarks whether you, the reader, runs them, or whether you read benchmarks run by someone else, will have this information in the results.