Silence the struggle around cluster software stack configuration. Caos NSA is a distribution that focuses on making things simple, easy to install and upgrade, and easy to manage.
- Select network for cluster (eth2)
- Define the IP range for nodes (10.1.1.2 to 10.1.1.252)
- What number should the first node start with (1)
- Perceus registration
After step 2, the installation configures Perceus. It even configures and generates ssh keys (what a nice thing to do). Perceus registration is also optional, but if you don’t input something, you can’t continue (actually this is a feature and I’ll explain why in a bit).
After Perceus was installed, Sidekick asked if I wanted to check for updates. It found a few updates, all of them for the desktop. It asks if you want to install the updates or not (looks like yum to me) and then does a little housekeeping. During this housekeeping phase it erases the installation packages and does a few other things including setting up
ntp for the cluster (if you run a cluster, you need to run ntp).
At this point, the master node was ready to go. I needed to grab a VNFS capsule so I used wget to pull down a premade capsule for Caos NSA
# mkdir CAPSULES
# cd CAPSULES
# wget -v -c http://mirror.caoslinux.org/Caos NSA-1.0/vnfs \
Once the capsule is downloaded you need to “import” it, so Perceus knows about.
# perceus import vnfs \
During the importing process, Perceus will ask you a few quick questions. For example, it asks what root password I wanted to use for the capsule, which ethernet device was to be used for booting, and what the address was of the machine holding the capsules. After a few minutes, I was able to check if the capsule had been imported,
# perceus vnfs list
At this point Perceus knows about the capsule I wanted to use.
Everything is ready to go and we can start booting compute nodes. I then booted my first compute node with a monitor plugge and keyboard plugged into the node. I saw the node grab the Perceus OS via DHCP. It then said, that no VNFS image had been defined for node n00001. This makes perfect sense since Perceus didn’t know anything about this compute node at this point so it didn’t know which capsule I wanted it to use. I then told Perceus I wanted to use a specific capsule:
# perceus node set vnfs \
This command tells Perceus to use the particular VNFS image on all nodes n00000, n00001, …, n00009. At this point, Perceus told n00001 about its VNFS image and sent it to the node. The next thing I know, the node is up and running!
I could easily check that the compute node was up by,
# ssh n00001
as root. If it succeeded the node was up and ready. By default, Caos NSA configures Perceus to NFS export
/var/lib/perceus/ from the master node and mounts them on the compute nodes.
The Cluster is Up – Now What?
At this point people may say, “you’ve got the cluster up but it’s not running jobs yet.” You are correct. So, let’s rectify that situation. We will need a compiler (C and Fortran), an MPI that is built to use the compilers, and a job scheduler.
Caos NSA installs a compiler suite, gcc-4.1.2, be default as well as openmpi.
# gcc -v
gcc version 4.1.2
# gfortran -v
gcc version 4.1.2
# rpm -qa | grep -i mpi
Caos NSA installs these packages be default when installing Perceus. Even better – Caos NSA installs environment modules, commonly just called
modules in the cluster world, with Perceus. This column is too short to explain what modules can do for you, but if ever want to use more than MPI library, more then compiler or version, more than version of an application, then modules is what you need. It solves so many problems. Just google for “environment modules” and the first hit should be the correct website (modules.sourceforge.net).
If we run the command, “modules avail” we will see all of the modules that are preconfigured with Perceus.
----------------- /usr/Modules/modulefiles -----------------
dot module-info modules null use.own
------------------------- /etc/modulefiles -----------------
I don’t want to discuss the in’s and out’s of environment modules.
But we can easily “load” the openmpi module with the command:
# module load openmpi/1.2.4
# module list
Currently Loaded ModuleFiles:
1) null 2) openmpi/1.2.4
So we have an MPI library ready to go, what we need now is a job scheduler. Perceus doesn’t install one by default, but the Caos NSA team has packaged one that you can easily install called Slurm (trust me – just google it). You can use Sidekick to install it by the starting up Sidekick at the command prompt and selecting “Services” and then scrolling down and selecting “Slurm.”
Slurm asks you if it should be installed as as control service (yes) and then asks the names of the nodes that should be used. In my case the nodes I wanted to use were home64, n00001 (I only had two nodes at this point including the master node which is home64). Then the installation gives you directions on how to install Slurm in the VNFS capsule since you will need some parts fo Slurm on the compute node. The instructions are easy (write them down).
Slurm Compute Node instructions
# perceus vnfs mount [vnfs name]
# cp -ra /etc/slurm/* /mnt/[vnfs name]/etc/slurm
# perceus vnfs umount [vnfs name]
It took about 1 min to install Slurm on the master node and about 2 mins. to update the VNFS capsule. Then I just rebooted the compute node and it had slurm on-board.
How Long Did it Take?
Since Caos NSA is really an all-in-one cluster kit, I was curious how long it took me to install it and get a cluster up and running. So I timed myself to install Caos NSA and get Perceus going. Here is the time
Table for Installation Times
|Install Caos NSA including Perceus
|Download capsule using wget
|Time to import capsule
|Time to boot node and send capsule
|Time to install Slurm on master node
|Time to rebuild capsule with Slurm
|Time to reboot the compute node
I don’t know about everyone else but 23 mins. to do a complete master node installation with a firewall, cluster management installation and configuration, job scheduler installation and configuration, and getting the first node booted is the fastest I’ve ever seen! I know Joe Landman has talked about installing Rocks in 60 minutes. I will admit that I ran through the installation one time to make sure I had all of my hardware correct, but I didn’t actually configure the cluster during the first run. So 23 mins. is pretty close to the time a first time installation will take.
Simple as Pie
So I now have a functioning Caos NSA/Perceus cluster in 23 minutes and that includes building the master node, downloading the capsules, and booting the first compute node. That’s pretty darn fast in my opinion. Plus I now have a Perceus configuration where I can boot as many compute nodes as I want as well as a job scheduler that is up and running.
I highly suggest you give Caos NSA and Perceus a try if you have a cluster you are bringing up. It’s rather easy and even, dare I say, fun. Plus the Caos NSA distribution doesn’t get in the way of things and contains the major things I need for building a stable cluster.