Personal tools
You are here: Home Developer Interest Groups Provenance Interest Group Archive Kepler Provenance Framework

Kepler Provenance Framework

STATUS

Red TriangleThe content on this page is outdated. The page is archived for reference only. For more information about current work, please contact the Provenance Group.

 

Overview

Provenance is saving place of origin and derivation information about the scientific workflow runs. It is very important for the ability to recreate results at a later time and for proving the validity of the process that was followed when the published results were generated. This also helps the user and publication reviewer/reader to have an idea on how the run happened and what parameters and inputs were used for the run.

 This document is intended for Kepler developers. It is a DRAFT DESIGN DOCUMENT and does not reflect functionality as it currently exists in Kepler. Comments and feedback are appreciated. 

Related Documents

Types of provenance

We categorize provenance information that needs to be saved as:

  • Data provenance
    • Intermediate results
    • End results
  • Process (=workflow instance) provenance
    • Keep the data and parameters used in the WF run
  • Error and execution logs

Issues related to saving provenance information are: keeping track of the level of information to be saved, the format of information and where to save it, dynamic data and parameters changes during the run and in time, saving process products(workflow instances), and the information on how and by who the run was made.

Provenance Recorder Utility Design

  • The Kepler provenance recording utility is designed to be parametric and customizable to allow for different report formats, different levels of detail and cache destinations.
  • It will also save information on the user who ran the workflow and when the run happened.

Parameters of the Provenance Recording Utility:

RAN BY (Name):

Workflow user's name.

INSTITUTION NAME:

User's institution.

RUN (Experiment) NAME:

Name for the workflow instance that is being ran.

DATE:

This will be a constant set automatically.

LEVEL OF DETAIL:

This is a combo (selection) box that specifies how much information one wants to record.
  • Types of information to be saved: The recorder will be able to generate 3 files depending on the level of detail.
    • Data file:
      The parameters (only if they are updated explicitly) and outputs for each firing of each selected (or all) actor in the workflow, annotated with actor name. ('Cause actor names are unique.)
    • Workflow instance file: This is MoML for the run. It includes all the specific parameters for that specific run.
    • Execution&Error Log: This is the info on the errors and the manager state/execution status.
The names for these files will be formatted as wfName_date_time_type.ext:
    • wfName_mmddyyyy_hh-mm-ss_data.ext
    • wfName_mmddyyyy_hh-mm-ss_wf.ext
    • wfName_mmddyyyy_hh-mm-ss_error.ext
Below are the different levels and explanation on what information is saved for each level. All the levels (but the error log) should save the results going into the sink actors (actors without ports) as they are the end results of the workflow.
  1. Verbose-all: This is the maximum amount of information that can be saved.
    All the three types of files above are saved.
  2. Verbose-some: When the user selects verbose-some from the drop-down selection box, a selection dialog will pop up when the workflow is ran. This will give a list of all the actors with checkboxes and the user will select the actors that she likes the data information to be saved for. This feature requires the list of actors to be included maintained, so that no modifications to the base actor class in Ptolemy II is required.
    All the three types of files above are saved. The data file is created only for the selected actors.
  3. Medium: Includes workflow and error logs.
    Only workflow and error log files above are saved.
  4. Error Log: Execution and error logs.
    Only the execution log file is saved.

 

OUTPUT FORMAT:

This is a drop-down selection (combo) box that specifies the format of the output. Currently, we would like to have plain text, TEXT/HTML and XML

.

The XML output later can be used to generate reports (e.g. HTML pages) by using an XSLT or docbook-like transformer.

LOG DIRECTORY (Only for localhost and no cache.):

This is where the user will have the log files. This is only for local host. The default here is .kepler under $HOME.

CACHE DESTINATION:

Currently, the cache destination parameter can one of the following 3 modes below. Default value is NO CACHE.
  • NO CACHE: The provenance information will be saved to files under Log Directory.
  • localhost: The provenance information will be saved to your log directory, but the caching is also enabled for smart re-runs. (See below.)
  • To database: Here we will define a database schema and define how the saved data will be queried and used by the workflow system. This is not completely designed yet. This will also have to come up with an additional set of parameters for db information.
  • To SRB: Not designed yet. Same as to database.

 

The File Format for the Workflow Log File

Workflow log file is the MoML file with the right set of parameters. It should be annotated on top with the information on user and run. It is JUST and xml file. There's no HTML version of it.

The File Format for the Data Log File

XML

<element User_Name="name"/>
<element User_Institution="place"/>
<element Run_Name="name"/>
<element Time_Date="value"/>
<RUN number="value">
    <Actor name="name">
        <Params>
            <Parameter name="name">value</Parameter>
            <Parameter ....
        </Params>
        <Iteration number="value">
            <Output_Port name="name">
                <Value_Sent type="token_type">value</Value_Sent>
                ....
            </Output_Port>
            <Output_Port...
            ...
        </Iteration>
        <Iteration number="value+1"....
            <Output_Port...
                <Value_Sent...
            ...
        </Iteration>
    </Actor>
    <Actor.....

    </Actor>
    ...
</RUN>

HTML

HTML version of the data log will be formatted either on the fly (if the user only wants html logs) or later on using the information in the XML file. The default for the data log file format is XML.

The API for the Provenance Execution Listener

/** The ProvenanceExecutionListener (PEL) is an event based 
 *   approach to collecting provenance information.  When an event  
 *   occurs the manager notifies the PEL of different events just as it would  
 *   notify any other execution listener.  The difference is that we also collect 
 *   change events as well as some other events that we have created on 
 *   our own.
 */
public class ProvenanceExecutionListener extends Attribute 
                     implements ExecutionListener, ChangeListener {

    /** Three constructors that are needed because we extend Attribute.
     */
    public ProvenanceExecutionListener(); 
    public ProvenanceExecutionListener(Workspace workspace); 
    public ProvenanceExecutionListener(CompositeEntity container, String name);
    
    ///////////////////////////////////////////////////////////////////
    ////                         parameters                        ////
    
    /** A parameter to hold the name of the workflow user 
     */
    public Parameter ranBy;
    
    /** A parameter to hold the name of the institution the
     *  user works for.
     */
    public Parameter institutionName;
    
    /** A parameter to label the name of this run or experiment
     */
    public Parameter runName;
    
    /** A parameter to keep track of the date
     */
    public Parameter date;
    
    /** A parameter to set the level of detail we want to save
     */
    public Parameter levelOfDetail;
     
     /** A parameter that allows the user to specify the desired 
     *   output format.
     */
    public Parameter outputFormat;
    
    /** A parameter to set the destination of the saved information
     */
    public Parameter cacheDestination;

    ///////////////////////////////////////////////////////////////////
    ////                       public methods                    ////
        
    /** Report that the actor specified is about to fire (call the prefire() 
     *   method). IterationCount is a parameter given to SDF actors that are 
     *   about to fire so we can record this as well.  actorAboutToFire is useful  
     *   for figuring out how many times an actor tries to fire or for debugging.
     */
    public void actorAboutToFire(Actor actor, int iterationCount);
    
    /** Report that a change request has been successfully executed.
     *  This method is part of implementation of ChangeListener.
     */
    public void changeExecuted(ChangeRequest change);

    /** Report that a change request has resulted in an exception.
     *  This method is part of implementation of ChangeListener.
     */
    public void changeFailed(ChangeRequest change, Exception exception) ;    
    
    /** Report an execution failure.  We get this by default by implementing
     *   the ExecutionListener.
     */
    public void executionError(Manager manager, Throwable throwable);

    /** Report that the current execution is finished.  We get this by 
     *   default by implementing the ExecutionListener.
     */
    public void executionFinished(Manager manager);

    /** Report that the manager has changed state.  We get this by 
     *   default by implementing the ExecutionListener.
     */
    public void managerStateChanged(Manager manager);
     
    /** Report that the token given to the listener has been sent on the
    *   specified port and channel.  This is important if the user wants to see 
     *   what tokens were sent right before an error occured.  Also, we need these 
     *   tokens if a person wants to do "smart reruns"
     */
    public void tokenSent(int channelIndex, Token token, TypedIOPort fromPort);

Smart Re-runs

The goal of "smart re-runs" is to efficiently rerun workflows that have been run before. When smart re-run is selected, the user might have the ability to utilize the cache or part of it for the run. This would keep Kepler from repeating some work that occurred during a previous run. We may let the user choose at runtime what actors are able to take advantage of the smart re-runs. An example of a case where and actor could not take advantage of the cache is when an actor depends on temporary files that might be deleted after the run.

 

Contributors: Ilkay Altintas, Oscar Barney

Open Questions:

  • How would this be used in the KSW framework? Can it be useful if it was packaged in archives for later use?
  • What are the possible data/metadata formats do we wanna add so the users can search for the provenance information later?
  • How can we generate meaningful reports from the provenance information?
  • Can we save the files into SRB and/or an RDBMS? Can we use the SRB space that was created for Kepler?
  • What will be the DB schema to allow for meaningful searches?
  • Can we insert all the information into a relational DB and have the files generated from the DB later if required? If so, what would be the relational schema for the provenance info?

Comments:

 

Document Actions