HPC without coding in MPI is possible, but only if your problem fits into one of several high level frameworks.
Not all HPC applications are created in the same way. There are applications like Gromacs, Amber, Fluent, etc. that allow domain specialist to input their problem into an HPC framework. Although there is some work required to “get the problem into the application”, these are really application specific solutions that do not require the end user to write a program. At the other end of the spectrum are the user written applications. The starting points for these problems include a compiler (C/C++ or Fortran), an MPI library, and other programming tools. The work involved can range form small to large as the user must concern themselves with the “parallel aspects of the problem”. Note: all application software started out at this point some time in the past.
There are also programming languages that attempt to remove the “parallel execution aspects” from the programmers responsibility. These languages are used in a variety of contexts. For example, Linda, ZPL, Cuda, OpenCL, Ct, and others attempt to remove much of responsibility of the parallel execution form the end user. While these languages are valuable, they may represent a challenge for some users and may not be as portable as the more explicit “C/Fortran and MPI solution.”
Just above these implied parallel languages, there is a middle ground that I like to call “Sledgehammer HPC.” These methods are not so much finished applications as they are a framework for representing your program at an abstraction level that can be easily executed in parallel. This type of approach allows you to focus on your problem at hand and not think about the number of cores or nodes. Some of these methods tend to be a bit domain/problem specific and may not fit for all problems, or it may be difficult to cast a problem in the framework, but if they do work, they allow one to throw the full weight of a cluster at your problem with minimal programming effort.
There are many possible frameworks in this area. In this article, we will cover three that I find most interesting — Genetic Algorithms, Cellular Automata, and Map-Reduce. One could consider graphics rendering in this category, but ever since the Titanic movie, the Linux Cluster CGI story has been told many times over.
Genetic Algorithms (GAs) have been around for a while and have been used in bioinformatics, phylogenetics, computational science, engineering, economics, chemistry, manufacturing, mathematics, physics and other fields. The idea is based on evolutionary biology where a population of possible solutions are tested against a fitness function. The best results are saved and the wost are removed from the population. The populations is restored by creating new solutions from old ones using methods that mimic genetic biology — cross-over and mutation. The fitness test is re-run on the new population and the process repeats. The algorithm terminates when either a maximum number of generations has been produced or a satisfactory fitness level has been reached for the current population. If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached.
A few things are worth noting. GAs are good at optimizing, however, with any optimization problem, landing on a local maximum is always a problem. The robustness and effectiveness are determined by the genetic methods used to re-seed the population, the population size, and the run time. In addition, solutions are typically represented as binary strings that can be easily manipulated. For instance, a random mutation may be introduced in one individual (candidate solution) by flipping a bit or bits. Likewise, cross-over may take some of portion of one individual and swap it with that from another. While such a low level “twiddling” as a way to solve a high level problem seem inefficient and not even possible, GAs do work. They often require large amounts of computer time, particularly if the fitness function takes a while to run. Of course, in every iteration there will be many combinations that don’t work or work poorly as solutions. These apparent “wasted” compute cycles are what give GAs their power as there is no “knowledge” as to what the best combination might be. A slight modification of a poorly performing individual may push it to the top (away from a local maxima). The need for many compute cycles is why GAs are great candidates for parallel HPC.
To get an idea of how a GA works, consider the follow pseudo code representation of a GA (taken from Wikipedia).
- Choose initial population
- Evaluate the fitness of each individual in the population
- Repeat until termination: (time limit or sufficient fitness achieved)
- Select best-ranking individuals to reproduce
- Breed new generation through crossover and/or mutation (genetic operations) and give birth to offspring
- Evaluate the individual fitness of the offspring
- Replace worst ranked part of population with offspring
There are several areas for parallelization in a GA. First, each individual is independent until it is time to cull and re-seed the population. This is a potential for parallel execution, but there may be a better way if the fitness function requries a lot of compute cycles. If the population is large, and step 3 above requires a non-trivial amount of CPU cycles, then it is quite easy parallelize the fitness function tests. In many cases, the fitness function represents a reduction of data to a single number that can be used for evaluating solutions. Therefore, very little information needs to be transmitted once the fitness function is computed.
As you can infer, the user is required to create the fitness function and design the data representation to be used as the individuals in the population. While this may take some thought, once the problem is caste in “GA terms” you can unleash the full power of a cluster on your GA. Need a more power, no need to touch the code, just add some nodes.
Before we leave GA land, let’s consider a simple example. Most compiler have more options than any human can manage. To test various options and option combinations you need to try each combination and see which one produces the best run time. Tedious to say the least. One could “script up” such an experiment, but why not try a GA. The fitness function would be compiling, running the program, and recording the time. The options (individuals) representing the highest times are removed, new individuals are created, and fitness tests run again, repeating the cycle. While I am glossing over details, you get the general idea. If you want to get really fancy, throw in MPI options (Open MPI is good candidate!). In this case you may need several nodes to calculate your fitness function.
If you want to play with parallel GAs you can check out PGAPack (Parallel Genetic Algorithm Package) a general-purpose, data-structure-neutral, parallel genetic algorithm library developed at Argonne National Laboratory. There are others of course (MPIKAIA). I’m not sure which is the best one. Maybe, I’ll write a GA to find out. More information is available at genetic-programming.org.v