Installing Distributed Workflows
STATUS:
The content on this page is outdated. The page is archived for reference only. For more information about current work, please contact the Distributed Execution Interest Group.
Overview
Instructions for installing a distributed workflow. See Working Distributed Features for more information.
Java RMI and special compilation instructions
Note: this assumes you can complete a full checkout and compilation of both kepler and ptII. If you can't, go here.
Kepler uses Java Remote Method Invocation (RMI) to pass messages between the master and slave nodes. This requires a special step that is not part of the normal Kepler compilation process. To get this running, follow these steps:
(Note, this has only been tried with Java 1.5.x. Make sure you are using a "real" version of Java, not some preloaded windows or mac version. They won't work. To test it do a 'java -version' on the command line and make sure it spits out something like this: 'java version "1.5.0_07"...'.)
- Make sure you can compile Kepler with the 'ant compile' command. If anything doesn't compile, fix it. Your JAVA_HOME environment variable should be set to your java-1.5.x directory and your PTII and KEPLER variables should be set to their respective directories.
- Now you need to start the RMI registry on the master.
add the '&' so that rmiregistry runs in the background. Make sure the rmiregistry command executable is the one in your normal Java directory (again, not the one that is preinstalled). To check this, on a *nix machine, you can type 'which rmiregistry'. The result should tell you where the rmiregistry command is issuing from. If it is not in your Java directory, change your path so that it is.
Note: For Windows machines, the absolute path to the kepler classes cannot contain spaces. The command to start the RMI registry is:
start "rmiregistry -J-classpath '-J%KEPLER%/build/classes;%PTII%/build/classes;%KEPLER%/lib/jar/jargon_v2.0.jar'"
Running a slave controller
The slave controller allows the master to communicate over RMI with the distributed node. Depending on your setup, you'll probably want to have a full kepler/ptII CVS checkout on each slave host. Here are the steps to get the slave running:- Login to the host and check out (or somehow get a copy of) Kepler and ptII. Follow the same steps above for compiling Kepler, and running rmiregistry. All of those steps must be complete before you try the next step.
- Edit the build.properties file. You can set options for registering your host on the EarthGrid (or not). Follow the instructions in the build.properties file.
- If you would like to only allow specific users access to your slave, edit the file configs/ptolemy/configs/kepler/DistributedKeplerSlaveACL.config. You can select users to allow or deny access to. Usernames should be in the form of a registered LDAP distinguished name (i.e. uid=berkley,o=NCEAS,dc=ecoinformatics,dc=org). You can also allow 'all' or deny 'all' and select the order in which the access control instructions are processed.
-
Run the slave controller. The command is:
ant runSlaveController
The slave should print a status message stating your slave is now ready to accept RMI requests from Kepler. When you want your slave to exit, type 'X' and allow the slave to shutdown. Failing to do this will create a "ghost" entry on the EarthGrid (if you have registered your slave on the grid).
Running the distributed workflow
Now that your slave(s) is/are running, you need to tell Kepler which slaves to execute on. Go to the "Tools" menu and choose "Distributed Computing Options". Enter your authentication information then choose the slave controllers you want to use. The authentication information is used for logging into each slave. You must use a valid LDAP/EarthGrid username, domain and password. The "available" list is populated from slaves registered on the EarthGrid. If you don't see the slave you want to use, you can add it to the "used" list manually.
Once you have chosen your slaves you can run your workflow. Open the workflow you want to run and click the run button. If you're watching the terminal output from your slaves, you should see a bunch of messages telling you what the slave is doing. If you have any errors, Kepler should tell you. You can use the terminal output of both the slave and the master to troubleshoot.
For your first workflow, try using one of the examples in workflows/distributed. There are several in there that are known to work and should get you started.
If you don't see any output from the slave, you may want to check that your firewall is allowing connections on port 1099 (Java RMI).
If you have a specialized DistributedCompositeActor that you want only to run on a specific slave (or slaves) instead of all of the "used" slaves from the configuration dialog, you can double click on that DCA and choose only a subset of the "used" slaves. This allows jobs with specific computing needs to be run on only certain slaves. It also allows you to move the execution to a slave where a large dataset already exists, but not have the rest of a workflow execute there.