Kepler Provenance Use Cases
STATUS
The content on this page is outdated. The page is archived for reference only. For more information about current work, please contact the Provenance Group.
Overview
This document is intended for Kepler developers
interested in discussing potential uses of provenance in Kepler workflows. It
complements the Kepler Provenance Framework draft
design document. Please feel free to add use cases, examples, and other
ideas to this document
Discussion topics
Provenance in scientific workflows is a broad topic. This section breaks the topic down into a number of subtopics that may have significant overlap, but hopefully provide useful perspectives. Subtopics likely will range from the abstract to the technical.General uses of provenance in scientific workflows.
- Create and maintain associations between workflow inputs, workflow
outputs, workflow definitions, and provenance information.
- One of the chief advantages of scientific workflows (over scripting, for example) is the potential for managing all of the information related to a workflow run automatically.
- Ideally, workflow users would not need to keep track of everything they do using ad hoc directory structures and file names.
- Debug a workflow design.
- One could use the provenance framework to reconstruct the sequence of events that occurred during a workflow run.
- This could be an effective alternative to stepping through workflow execution using a debugger.
- Replicate the results reported by another researcher.
- It should be simple to repeat a result by re-running a scientific workflow using the same workflow definition, inputs, parameters, etc.
- It should also be possible to use provenance information to repeat key steps in the workflow manually (i.e., outside the workflow framework) by examining the process by which results were generated by the automated workflow.
- Determine what data input to the workflow contributed in any way to a
particular output data product.
- Simply listing all inputs to the workflow is insufficient since some inputs may have had no effect whatsoever on a particular output data product.
- Determine what workflow components (actors) were involved in generating a
particular output data product.
- Simply listing all actors in a workflow definition is insufficient since workflows may include parallel paths and conditional control-flow constructs.
- Perform 'smart re-run' of a workflow.
- Provenance information could be used to recalculate efficiently the results of a workflow using new values for a subset of inputs and parameters to the workflow.
- Key intermediate results stored as provenance could be used to avoid performing some of the computations again.
- Evaluate the results of a workflow without rerunning it.
- A worker may not be able to judge from the inputs and outputs of a workflow alone whether the results should be trusted. Inspection of key intermediate results may be required.
- Archive scientific results in a repository and retrieve them later.
- Automatically collecting provenance information could make it much easier for scientists to archive their results in public databases.
- Storing provenance in public databases along with results would enable searches for results derived from particular data sets, public databases, or other sources.
- It would also enable searches for results created by particular workflows, actors, or underlying scientific software.
- Checkpoint and restart workflows.
- Support for storing intermediate states and data products might also provide a way to checkpoint workflows periodically.
- This might allow one to restart workflows that stopped prematurely due to a computer system crash, network outage, etc.
Information that might be recorded by a provenance system.
This section lists different categories of information that might be recorded by, sent to, stored in, accessed through, or otherwise processed by provenance management frameworks, systems, and tools. It should not be assumed that any of this information must be literally "stored in" the provenance system.
- Workflow results
- Obviously provenance is not much use without the results they annotate.
- Somehow we need to attach the results to the provenance information (or vice versa).
- Storing names or paths of files generated in workflow is not sufficient since files can be renamed later, etc.
- Workflow definitions
- The particular definition of the workflow needs to be stored along with results.
- Since a workflow definition can be updated between runs, simply referring to the name of a MoML file is not enough.
- The definition at runtime needs to be stored immutably with the rest of the provenance information.
- Static parameter values
- The values assigned to workflow and actor parameters before the workflow is run are stored in the MoML file. Solve the problem of attaching workflow definitions to results and provenance and we solve this problem as well.
- Dynamic actor parameter values
- Parameter values may change during the course of a workflow run.
- For example, in collections-based workflows input collections may override particular actor parameters by including special metadata tokens.
- Similarly, upstream actors in collections-based workflows may override parameters of downstream actors by inserting special metadata tokens into the data stream.
- In general, workflows may iterate over parameter values or search a parameter space, running a particular actor or subworkflow multiple times using different parameter values.
- The parameter values assigned to an actor when it fires need to be associated with the outputs affected by those parameters.
- Workflow inputs
- References to data files read by the workflow might be stored as static parameter values. But perhaps the files themselves (or their contents) should be stored explictly as provenance as well? How else can we ensure that workflow results can be replicated in the future, especially by different workers? References to permanently archived data sets might be safe enough, though.
- Results of queries of mutable data sources are not reproduceable in general, and might need to be stored as provenance. For example, the results of a BLAST search of GenBank might return different results on different days.
- Data read directly from sensors or instruments in a workflow might need to be stored. For workflows automating instrument control or directly processing sensor data, this data might not be stored anywhere else.
- Interactive inputs from users of workflows must be recorded.
- Intermediate data products
- Intermediate results may be expensive to recalculate if they are needed in future.
- Intermediates may be useful as inputs to other workflows. It should be easy to extract intermediate data products from the provenance system and use them as inputs to different workflows.
- Verification of workflow outputs may require manual or automated examination of certain (but not all) intermediate data products at a later time. For example, a researcher evaluating the quality of a phylogenetic tree may wish to see the the protein sequence alignment generated in a workflow and used later in the same workflow as a character matrix for inferring the tree.
- Intermediate results would be needed to optimize the performance of 'smart re-runs.'
- Debugging of workflow definitions might require storing many (or even all) intermediate data products.
- Unfortunately, storing all intermediate data products might not always be practical due to the large number and size of tokens some workflows generate. Besides debugging, what other use cases might require storing every token sent between actors?
- Temporary files created by actors
- Actors that wrap external programs have access to the standard input, standard output, and standard error streams associated with each program run.
- An external program might write temporary output (e.g., log) files to disk as well, and the actor wrapping the program may or may not parse these output files.
- The contents of these I/O streams and files might be useful to keep as provenance under some circumstances. For example, validation of results may require manual examination of files that are generated by wrapped applications.
- However, leaving these files on disk where they were created can lead to a real mess!
- Information about actors
- Provenance should allow one to answer questions like, "What actors were used in this workflow run?" and "What actors were used to compute this particular data product?"
- MoML (and KSW) files do not answer this question since they list every actor
in the workflow definition, not just those actors that actually fired during a
workflow run or contributed to a particular workflow output.
- A particular execution of a workflow containing conditional control-flow constructs may not use some of the actors described by the static workflow definition at all.
- Workflows may try a number of different algorithms (provided by distinct actors) and keep all or only a subset of the aggregate results.
- Answers to these questions are needed in order to properly cite software and literature in scientific publications reporting results computed by a workflow.
- We need to record versions of external software packages employed by actors for similar reasons.
- Human readable annotations
- Ideally, the outputs of workflows would be self-explanatory to humans examining them.
- Some scientific applications require that the program be cited if its outputs are used in a publication.
- We could provide actor authors with an API for annotating outputs with comments, copyright notices, required literature citations, etc.
- References to discarded data
- Data or data sets not meeting some criterion may be discarded in a workflow while the rest of the data is processed normally.
- Similarly, data sets that trigger (caught) exceptions may be discarded.
- The absence of data or results can be as significant as their presence.
Contributors
- Tim McPhillips