Good Old Ethernet

Did you know there are two projects that can give Ethernet a performance boost?

Every once and a while there are projects that just seem to hit the nail right on the head. Two such projects are GAMMA and Open-MX (Note: the dash is important as there is an HPC application that is called OpenMX — Material Explorer for nano-scale material simulations). Before I tell you why I’m excited about these projects, let me provide a little background.

Although high performance interconnects are an important part of HPC clustering, there are a large number of clusters that use standard Gigabit Ethernet (GigE). Indeed, 56% of the Fall 2008 Top500 systems used GigE. This result is what you might expect as almost all motherboards have at least one (if not two) GigE ports and large GigE switches are quite reasonably priced. For many applications GigE just works. In terms of the OS it is plug and play — although there is some tuning you can do with today’s GigE chipsets. In any case, when GigE is used on a cluster virtually all communication goes through the kernel using either UDP or TCP services. While many HPC mavens consider kernel services to be a slow due to kernel overhead and data copying, it is still very reliable and contributes to the plug-and-play nature of a GigE cluster. In addition, virtually all MPI’s support a TCP/IP transport layer as well. Ethernet “just works.”

As mentioned, using TCP/IP as a transport adds overhead to the messages sent between cluster nodes. The overhead increases latency and can lower throughput of the interconnect. While many application can live with TCP/IP performance, multi-core among other things, has strained the performance of a standard GigE connection by putting more pressure on the interface. Instead of two single core processors sharing an interface, now 8 cores or more may need to share a single GigE interface.

One solution to the GigE/TCP/IP bottleneck is to get a better interconnect (i.e. Myrinet, Infiniband, etc.) If your budget allows and your application merits such an expense, then this is good solution. If on the other hand, you cannot afford a better interconnect then there was little you could do to help with GigE performance (as mentioned there is some tuning that may help). GigE users have always envied the high end networks because they “by-pass” the kernel and allow user processes to write directly to each others memory. There is no TCP/IP copying or protocol overhead, the data just goes from one node to the other as fast as possible. Such methods require a version of MPI that supports their transport API, but all modern MPI version have support for the popular by-pass libraries.

Creating a kernel by-pass Ethernet certainly makes sense in the HPC world. There is the issue getting all the chip-set vendors to include some code in their drivers and maybe a kernel patch would be needed as well. In the open source world, however, users are free to create their own solutions. Such is the case with the GAMMA (The Genoa Active Message MAchine) project at DISI (University of Genova, Italy). The project is run by Giuseppe Ciaccio and has been producing some very interesting numbers for many years. Due to the need to maintain changes in the drivers, GAMMA now supports a small subset of Ethernet chipsets from Intel and Broadcom. For example, with the appropriate Intel NIC you can expect latency on the order 6.1 μs (10.8 μs with a switch) and a throughput of 123.4 MBytes/sec. The N/2 value is 1820 bytes (7600 bytes with a switch). Personally I find these numbers amazing considering an untuned GigE NIC wil
l produce latencies in the 60μs range. There is also a version of MPICH (MPI/GAMMA) that can be used to build and run codes. The performance of GAMMA does impose some restrictions. In addition to a small subset of NICs, a specific kernel version is required and the interface running GAMMA does not allow LAN traffic. It is therefore necessary to have a second network for standard traffic (NFS, ssh, rsh, etc). GAMMA is more than a research project as it is used by the OpenFOAM package (An open CFD package). Consult the GAMMA page for more information.

In addition to GAMMA, I have been watching the Open-MX project. Compared to GAMMA Open-MX is relatively new. Open-MX is a kernel by-pass message layer that brings the Myrinet Express (MX) protocol to Ethernet. Unlike GAMMA, it requires no kernel or driver modifications and will work on all Ethernet NICs. Performance will vary by NIC of course, but in my own tests using Intel 82572EI PCIe NICs I have been getting 20 μs latencies (with a switch) and a throughput of 117.5 MBytes/sec. Not quite as good as GAMMA, but Open-MX is much more flexible in terms use. There is an added bonus to Open-MX as well. Any software that support (e.g. various MPIs, PVFS2) Myricom MX libraries can link to directly to Open-MX. It is fairly simple to build any of the open MPI’s with the Open-MX libraries.
If you would like to learn more consult the slides and a video of a recent presentation by project leader Brice Goglin. Well worth a look.

With GAMMA and Open-MX good old Ethernet just gets better. While there are better interconnects for HPC, it is always nice to see someone tweak commodity hardware or have the freedom to innovate. Indeed, these two projects are why and an open approach to HPC works so well. Without access to the open source plumbing, the HPC community would have little option but to use what was given to them. In that case, I suppose I should give a nod to good old GNU/Linux as well. Thanks guys.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/ on line 62