Ready for the HPC MMA battle? Of course I mean a Memory and Messages Assessment
If you have read any of my past columns, you will notice I like to test assumptions and try obvious things. I often find that things do not always work the way people expect. As more and more cores show up in processors, one of the burning questions I have is; “How does the performance of an MPI program on an SMP node compare to a similar OpenMP program on the same node?”
The question is import. Nodes may have 24 or even 48 cores in the near future. Most codes use less than 32 cores, so why bother with the cluster? (That is a subject of another column.) Some may assume that a threaded OpenMP approach may be a slam-dunk in a shared memory (SMP) environment. Of course, I like to test this idea because rewriting code is not the best use of our time.
In a previous column, I had an opportunity to test OpenMP and MPI on a brand new 8-core SMP machine. My interest was to see how well MPI codes worked on an SMP platform. The actual machine was an 8-way Intel server that used two Clovertown processors (4-cores per socket). I also understood that many things other than the programming language (i.e. threads or messages) could effect the result, but I just wanted to get a feel for what would happen.The results were rather interesting (see the column). and there was no clear winner. The OpenMP should “blow away” MPI on an SMP assumption did not hold up.
Recently, I had access to a new 12-core Intel Xeon (dual X5670 processors at 2.93GHz) machine with 48 GB of DDR3 memory and the Intel 5520 chipset. It came preloaded with Red Hat 5 and was running kernel 2.6.18-128.el5. In terms of programming software I used Red Hat gcc/gfortran 4.1.2 (with OpenMP support) and Open MPI version 1.2.7. These are the “stock” versions that came with the install. I decided it was time to get another data point (or points) for my MPI vs. OpenMP tests. This time I had a different CPU, memory architecture, MPI, and compiler (not sure how different the compiler is, however.)
As I did previously, I used the NAS benchmark suite (version 3.2). You can find a description of the tests on the website. The NAS suite has the same programs written in both MPI and OpenMP so an “apples to apples” comparison is possible, although it should not be taken as an exact comparison as it is always possible to optimize for a given language.
The results of the NAS suite are reported in MOP/Second (Million OPerations per Second). The higher the better. With exception of IS (integer sort) the results are really floating point operations per second and represent performance on various math kernels used in aerodynamics. Each test was run three times and the average is reported.
The results are in Tables One and Two below. The winner is in bold and the percent difference between the two scores is given. I first ran the tests using the B level which determines the size of the tests. I used eight cores because most of the NAS MPI tests work best with a power of two for the number of processes. (Some tests require a square power of two; 4, 16, etc., and were not run). There were actually 12 cores available, but trying to use all of these would further reduce the number of possible MPI benchmarks.
Table One: MPI and OpenMP results for NAS suite B level tests on eight cores. (Million Operations per second, higher is better)
I then ran the C level (bigger problem size) and found that two tests did not run. In this case, I had four remaining tests with good data. As the problem size got bigger, there was no change in the leaders, but some of the differences changed quite a bit.
Table One: MPI and OpenMP results for NAS suite C level tests on eight cores (Million Operations per second, higher is better)
In comparing with my previous results, we see an interesting flip. First, CG went from being 7% faster with OpenMP (previous results) to an hefty 32% faster using MPI (current results). FT still works best using OpenMP but the gap is now much smaller. Similarly, IS is still way ahead using MPI, but the gap is narrowing, while LU and EP are about the same in terms of differences. Finally, the OpenMP version of MG is working much better and gained quite a bit of ground on the MPI version. Also note the overall improved performance over the Clovertown results.
As in my previous column, I conclude with “it all depends.” There are many variations in terms of compilers, processors, memory architecture, and not to mention your code. The golden rule of HPC, “test your codes,” certainly still holds because many assumptions do not.
I also wanted mention that improving MPI performance on SMP nodes has been recognized by both the Open MPI and MPICH2 teams. each version now employs KNEM, a Linux kernel module enabling high-performance intra-node MPI communication for large messages (i.e. to improve large message performance within a single multi-core node). This can only mean one thing — more testing.