Distributed Computing with Kepler on the NCEAS ROCKS Cluster
STATUS:
The content on this page is outdated. The page is archived for reference only. For more information about current work, please contact the Distributed Execution Interest Group.
Overview
If you have access to the NCEAS ROCKS computing cluster (or another cluster), Kepler will run pretty easily on the cluster nodes. Instructions follow. For more information, please see Working Distributed Features.
Distributed computing with Kepler on the NCEAS ROCKS cluster
The host name of the main node of the NCEAS cluster is "catbert.nceas.ucsb.edu." I will refer to it as "catbert."If you have access to the NCEAS ROCKS computing cluster (or another cluster), Kepler will run pretty easily on the cluster nodes. If you need an account on the NCEAS cluster, talk to Nick Brand.
To get started, check out ptII and Kepler into your home directory. You'll want to add your PTII, KEPLER and JAVA_HOME variables to your .bash_profile file so that all the nodes will have your same environment when you login. Compile and create the SlaveController stub the same as above but don't start the rmiregistry yet. Note that you will probably not be able to run Kepler in graphical mode, this will be discussed more below.
Next you need to create a link to your kepler and ptII directories in the /export/apps directory on catbert. The easiest way to do this is to have your kepler and ptII directories in a single common directory inside your home directory. The structure should look like this:
/home/<user>/project/kepler /home/<user>/project/ptII
then create a symlink to the project directory inside the /export/apps dir. You may want to create a directory inside /export/apps for yourself. The command is
ln -s /home/<user>/project /export/apps/<user>/project
Anything placed in the /export/apps directory will show up in the /share/apps directory of the cluster nodes. This is just an easy way to make the Kepler application available on all nodes of the cluster.
Next, you need to get the slave running on all of the nodes you want to execute on. In theory, you should be able to use the cluster-fork command to do this. I have not gotten it to work so far, so you'll need to start the slave manually on each host. It's best if you have a terminal window open for each cluster node you want to run on so you can see the output. If you don't care about the output, you can just run the slave in the background. To do this ssh to each host you want to run the slave on, then run the runSlave.sh script out of /share/apps/<user>/project/kepler. Note that the runSlave script will start the rmiregistry on each node as well. If the rmiregistry is already running, you'll see a message about port 1099 being already in use. You can usually ignore this error. Here are the commands to run the slave:
catbert$: ssh c0-0 c0-0$: /share/apps/<user>/project/kepler/runSlave.sh & SlaveController: Waiting for connections... c0-0$: exit catbert$: ssh c0-1 c0-1$: /share/apps/<user>/project/kepler/runSlave.sh & SlaveController: Waiting for connections... c0-1$: exit catbert$:
Do this for each of the nodes that you want to run on. It's kind of tedius, and cluster-fork should definitely be able to do this for you, but for some reason it won't run the java process in the background. Hopefully someone will figure that one out.
Once the slave is running on each node you want to distribute execution to, go back to catbert. Since we can't run Kepler in graphical mode on the cluster, you'll need to use the keplerexecute.sh script to run your workflow. You also need to make sure whatever workflow you're going to run doesn't try to open any display windows, graphics viewers or other gui components. It will fail if it does. There is an example in kepler/lib/workflow/distributed. Look for the workflow called xxx-headless.xml.
You'll also have to update your DistributedKepler.config file with the names of the hosts you want to run on. If the rmiregistry isn't running on catbert, start it at this time. Once you've done that, use this command:
catbert$: ./keplerexecute.sh <workflow>
You should see the execution distributing over the compute nodes if everything is setup right (and if you left your terminal windows open so you can monitor the nodes).