Not being one to gloat (cough), I would like to take this opportunity to say, “I told you so.” In case you are wondering what it was exactly I said, I will recap.
I periodically bemoan the state of parallel programming. My experience with parallel programming stretches back to the late 1980′s. At this time, I was working on various parallel programing ideas including Fortran conversion and Logic Programming. The lessons I learned were two fold. First, parallel computing is hard and second we need to re-think how we do things. This opinion is shared by virtually everyone who has worked in HPC over the years. The prognosis is not that great either. Take a look at Greg Pfister’s blog where he writes about comments like Parallelism Needs a Killer Application. By the way, Greg’s other blog posts are worth reading.
In a similar vein, Mike Wolfe of The Portland Group (producers of excellent optimizing compilers) writes in HPCwire about the recent OpenCL fanfare. His article entitled Compilers and More: OpenCL Promises and Potential provides a fair assessment of OpenCL. I like to think of these articles as “buckets of cold water thrown on the masses” who have been whipped up by various media types. Every time I hear the words “parallel programming”, “solution”, “simple”, and “breakthrough” used in a sentence, I usually know it is some kind of sales pitch. In the case of OpenCL, I have read too many times that OpenCL is a simple unifying way to program parallel GPU’s and host processors. Of course, things like CUDA and OpenCL are welcome additions to any GPU programmers tool box. They help quite a bit and are important technologies. Do they solve the general parallel programming challenge? Not really and neither does anything else.
Before I move to the second issue, I want to state that my definition of the “parallel programming challenge” is to allow a programmer to write one program and have it run with reasonable efficiency on multi-core, GP-GPU, or clusters. A tall challenge if there ever was one. I don’t think it is impossible, but it might take re-thinking how we currently do things.
I believe part of the rethinking we need to do is about problem expression and execution. For instance, I have been waving the Functional Programming Flag for a while. In addition, I believe dynamic execution can solve a lot of problems. That is, the program will need to decide how to do things at run-time and not compile-time. The basic idea is to use a functional approach to naturally express parallel parts of a program and create a queue of parallel work. When the program runs, the queue is executed depending on the type of resources available. If there is one core, then the queue is resolved one task at a time, if there are many cores, the queue is solved in parallel. I have explored this idea in a series of three articles I wrote several years ago. ( You Can’t Always Get What You Want, The Ignorance is Bliss Approach, and Explicit Implications of Cluster Computing).
Of course, I don’t propose that I am the only one considering this approach to a parallel computing. Indeed, I have always assumed there are those more adept than me working on variations of this idea. Recently, I was delighted to read that Apple has open-sourced a project called Grand Central Dispatch (GCD). Essentially, Apple believes GCD is a better way to manage multi-core parallelism than using threads and has included it in the latest version of Mac OS X (10.6, Snow Leopard).
What is GCD exactly? As always, a little background is needed. GCD is based on work queues and “C blocks”. Most people get the work queue idea, but “C blocks” are a somewhat new concept for most programmers. C blocks are not a standard feature of the language, but look like function definitions with one big difference — they can capture the state from their surrounding context and save it (read-only) within the block for later execution. Like C function definitions, they can take arguments, and declare their own variables internally.
When a block expression is executed both a reference to the code within the block and a snapshot of the current state of local stack variables at the time of its invocation is created. The code is not executed, but a type-safe opaque value is created that can be assigned to variables, passed to functions, and otherwise treated like a normal language value. One of the way to use blocks is to place parallel blocks in a queue.
GCD implements two types of queues, Dispatch and Operation. The dispatch queues are a C-based mechanism for executing custom tasks. A dispatch queue executes tasks either serially or concurrently but always in a First-in-First-out order. A serial dispatch queue runs only one task at a time, waiting until that task is complete before dequeuing and starting a new one. By contrast, a concurrent dispatch queue starts as many tasks as it can without waiting for already started tasks to finish.
Operation queues are more complex than the First-in-First-out dispatch queues and take other factors into account when determining the execution order. Primary among these factors is whether a given task depends on the completion of other tasks. You can configure dependencies when defining your tasks and use them to create complex execution-order graphs for your tasks in operation queues.
If you are a Mac user, you can play with GCD using libdispatch. The libdispatch project consists of a user space implementation of the Grand Central Dispatch API. In order to implement the full API for Grand Central Dispatch, C compiler support for blocks is required. Contributions to the project will be continually evaluated for possible inclusion in future releases of Mac OS X. The sources are available under the terms of the Apache License, Version 2.0 in the hope that they might serve as a launching point for porting GCD to other platforms.
If GCC gets block support then libdispatch will probably be available to the Linux crowd as well. Apple’s motivation for opening-up GCD is probably to gain more support for what has become a key, but non-standard technology in their kernel.
Here is my take on these developments. Aside from the fact that C is not a functional language, but blocks do retain state, recasting a program using a “work queue” model offers a huge advantage of traditional methods. Executing the queue can be decoupled from the program. Thus, a program can, in theory, run on one core, or one thousand, without having to change any code. Of course, it will run slower on one core, but it will still run. Or, consider a typical compute node with eight cores. Suppose a dynamic program is using all eight cores with a large number of jobs still remaining in the queue. Some simple algorithms can be developed that allow remote cores (other nodes) or even heterogeneous cores (different types of nodes) to be used by the program.
Finally, I’m going to predict that this approach will gain in popularity if C blocks become available in GCC and other compilers. Unfortunately, it will not be due to anything I have written (or will write). Nor will it be due to the queue model, dynamic execution, multi-core, or the open availability of GCD. The biggest driver will be a small, but powerful, addition to the C language called C blocks.