Distributed Computing in Kepler

STATUS:

Red Triangle The content on this page is outdated. The page is archived for reference only. For more information about current work, please contact the Distributed Execution Interest Group.

Overview

We envision an easy-to-use mechanism for Kepler users to harness the power of multiple computing nodes to accomplish computations that would be difficult, time-consuming, or impossible to accomplish on a single node. Ideally thins system should be tightly integrated into the Kepler experience so that scientists need not focus on the mechanistic issues surrounding grid computing and can instead focus on the scientific models they would like to construct and execute.

Such a system would have several important features, such as single sign-on, fine-grained access control, automated and efficient staging of data and models to nodes as needed, efficient scheduling of resources, monitoring of the progress of execution, and job control, among others.

Design

(Notes from the Dec 2006 attachments below)

A -> B -> C

source -> distrib -> sink

Assumptions: On initialing host

A/B/C independent
B - No dependency between iterations
B - Output order independent of input order (plan to change? Cost to ordering, don't do if unecessary)
B - exec time >> startup/transport time
Targeting PN, SDF, DDF, but B can contain other domains

Assumptions: On remote host

Kepler engine or agent is already running
Workflow to be run is not local (moml)
- Is the java code already in place?
All hosts use the same JVM version and kepler version

Process:

startup agents - external to kepler
Node discovery - config file or p2p
Select components to distribute from iterate actors in workflow
Choose/calculate nodes to use
- set max number of nodes
- may not know # of iterations
propogate B to nodes
Fire cycle for each token, T on B
1. Choose node N (How does local controller decide?)
2. Transfer T to N
3. Execute B on N
4. Transfer result of B,T-out to master node.
5. Write T-out to output of B on master node
Clean up/ Shut down remote workflow nodes (let them stay running for a while. have a mechanism to let them)

Error Handling - No response from node - Node can't compute - Node returns invalid token

SlaveController outside Kepler, on remote host. - it can do:

transfer to B
start workflow
stop workflow
report errors
clean up workflow
report load metadata

MasterController on Kepler. local - does the reverse of above plus

Transfer B
Set up stubs
Send tokens
Recieve errors
can monitor and resend a token, if wanting too long
start and stop remote workflow
heartbeat
status

Disallowed: only one fixed port for slave listener, Then slavecontroller tracks which dynamic port is be used by who. Tracking message IDs.

Lots of Grid-based and peer-to-peer systems exist that could be used as the basis for this design. Here are some possibilities:

Using JXTA for a peer-to-peer Kepler shared computing network
Batch submission systems
- OGSA managed job service
- NIMROD/APST
- Condor

Design Notes

Simple Flow

Start distributed WF on server
for each client, query if WF components are available
if not all components are available, do not use client
if they are available, send the client the WF
Client runs WF and puts it in the "receiving" state
server starts data flow and sends data token to client
client executes WF and returns result
server accepts result

Data transfer

Remote clients accept one token at a time.
it's up to the WF designer to select the appropriate type of token to send to his/her clients
if the token size is huge, a URLToken should be used....if it's small, a DataToken might be used

Document Actions

Print this

Sections

Personal tools