Personal tools
You are here: Home Developer Developer Forum Provenance Interest Group proposed additions to provenance framework stores

proposed additions to provenance framework stores

Up to Provenance Interest Group

proposed additions to provenance framework stores

Posted by Derik Barseghian at December 19. 2008

In discussing Reporting, the Workflow Run Manager, Publication Ready Archives and other additions to Kepler we've identified some things we'd like space made for within provenance framework datastores.

I thought it would be good to list all these proposed additions in one place, and at the same time solicit feedback for others. This seems useful to do now, before "version1" provenance stores begin proliferating. At the same time we should be wary of bandwidth and store-sizes, and not add anything extraneous.

So far I have:
* LSID - or similar universal id to unambiguously id a run
* workflow author
* workflow tags
* workflow description
* ROML (report object model language)
* RIO (report instance object)
* report (e.g. in pdf form)


More specific to my Workflow Run Manager-development needs, if you think a column to be searched against is sorely lacking in this mockup, please let me know.

The Workflow Run Manager mockup page also has at top a diagram of what I believe is a fairly modern version of the kepler provenance framework sql schema. You can get an idea of what is stored by looking at this diagram.

 

Re: proposed additions to provenance framework stores

Posted by Shawn Bowers at January 08. 2009

I was wondering if there is any documentation on what the Report Object Model Language (ROML) is.  I am also generally curious w.r.t. reports, what type of information is envisioned to be included in a report.  Is there a write up or page that describes this? Thanks!

Re: proposed additions to provenance framework stores

Posted by Matthew Jones at January 09. 2009

Hey Shawn,

There isn't a page for the reporting part of the work, but there is a directory in SVN with all of the reporting design documents.  It's here:

https://code.kepler-project.org/code/kepler-docs/trunk/teams-and-wg/4-interest-groups/provenance/reporting/

The architecture overview is in the file "kepler_reporting_diagram.graffle".

For ROML in n particular, you might look at the files "roml_layout.graffle", "schema/roml.xml", and "schema/riml.xml".

Matt

Re: proposed additions to provenance framework stores

Posted by Shawn Bowers at January 09. 2009

Hi Matt: 

Thanks for the reply. There seems to be a lot of detail in the docs in that directory. But, looking this over, I think I'm missing the "big picture". In particular, maybe you or Derik could give a two sentence, high-level statement saying what a ROML report is, and what type of information a report would contain. 

Thanks!

 

Previously Matthew Jones wrote:

Hey Shawn,

There isn't a page for the reporting part of the work, but there is a directory in SVN with all of the reporting design documents.  It's here:

https://code.kepler-project.org/code/kepler-docs/trunk/teams-and-wg/4-interest-groups/provenance/reporting/

The architecture overview is in the file "kepler_reporting_diagram.graffle".

For ROML in n particular, you might look at the files "roml_layout.graffle", "schema/roml.xml", and "schema/riml.xml".

Matt

 

Re: proposed additions to provenance framework stores

Posted by Derik Barseghian at January 13. 2009

Hi Shawn,

A report contains author-specified items associated with a workflow run. A user wants to present and highlight certain things from a workflow run to others -- things like input parameters, all output from a specific actor, resultant graphs, and so forth. It's user specified data pulled out of provenance and presented in a human readable and printer-friendly way (headers and footers, page numbers, etc).

Derik

Re: proposed additions to provenance framework stores

Posted by Shawn Bowers at January 22. 2009

Thanks Derik. I think this is very interesting. One thing that you did not mention, which to me seems like an essential, if not the most important aspect of tracking and reporting provenance information to a scientist is data lineage. In particular, data lineage refers to the data and processes that contributed to the creation of some output data product.  Tracking and reporting lineage information is extremely useful for determining if a workflow result is "correct" or can be "trusted". For example, determining the quality of a result may depend on whether (from a scientists perspective, e.g.) the input data used or the steps (algorithms/actors) involved in generating a data item are "appropriate/trusted". Another important use of data lineage information is to find those results that were derived from a particular data product, e.g., in cases where the input data was later found to contain errors, etc.  

In many Kepler workflows, not all input data and not all actors are used in the derivation of an output -- and so determining data lineage requires more than reporting which actors were in a workflow, or what their input and output was. 

Another potential source of confusion that may arise is that "data lineage" is typically used synonymously with provenance. This is the case in very broad fields including the archival/digital-library community, the database community, and also in the scientific workflow community.  In these areas, "provenance" typically refers to the processing history associated with some object/result -- where the interest is typically in understanding the provenance of an object (i.e., provenance refers to the lineage of a data item).  In both scientific workflows and in the database community, information surrounding runtime information that is not directly linked to lineage is often referred to generally as "logging" information.  

Within the scientific workflow community in particular, you might want to check out the first provenance challenge.  This is now a bit old, but it describes a scientific workflow and a set of questions that are typically asked w.r.t. a run of the workflow (most of which are based on lineage information; a small number are based on general metadata of data items). Other approaches, such as the Open Provenance Model (OPM), which is being developed as a standard model for representing provenance information within scientific workflows, also places a primary emphasis on lineage. 

Powered by Ploneboard
Document Actions