Kepler/CORE funded, Sept 1, 2007

The Office of Cyberinfrastructure at the National Science Foundation has awarded $1.7M over three years to a team of researchers from UC Davis, UC Santa Barbara, and UC San Diego to develop Kepler/CORE, a Comprehensive, Open, Reliable, and Extensible Scientific Workflow Infrastructure.

In recent years, scientific workflow research and development has gained enormous momentum, driven by the needs of many scientific communities to more effectively manage and analyze their increasing amounts of data.

Whether scientists are piecing together our ancestors' tale through Assembling the Tree of Life (AToL, pPod, CIPRes), deciphering the workings of our biological machinery by chasing and identifying transcription factors (ChIP2), studying the effect of invasive species on biodiversity (SEEK), observing and modeling the atmosphere and oceans to simulate and understand effects of climate change on the environment (COMET, REAP), trying to understand and tame nuclear fusion through plasma edge simulations (CPES), or probing the nature and origins of the universe through observation of gravitational lensing or simulations of supernova explosions (Kepler-Astro), science has become increasingly data-driven, requiring considerable computation resources, access to diverse data, and integration of complex software tools. To address these challenges, these and many other projects have employed the Kepler scientific-workflow system.

"Scientific workflows are the scientists' way to get more eScience done by effectively harnessing cyberinfrastructure such as data grids and compute clusters from their desktops", says Bertram Ludaescher, Associate Professor at the Dept. of Computer Science and the Genome Center at UC Davis, and principal investigator of Kepler/CORE.

Scientific workflows start where script-based data-management solutions leave off. Like scripts, workflows can automate otherwise tedious and error-prone data-management and application-integration tasks. However, unlike custom scripts, scientific workflows can be more easily shared, reused, and adapted to new domains. Many scientific workflow systems also provide 'parallelism for free'. The Kepler system natively supports both assembly-line like 'pipeline-parallelism', as well as 'task-parallelism' that enables multiple pipelines of tasks to operate concurrently. And unlike script-writers who must explicitly fork processes, manage queues, and worry about synchronizing multiple operations, Kepler users can let the workflow system schedule parallel tasks automatically. Other advantages over scripts include built-in support for tracking data lineage or 'provenance', which allows scientists to better interpret their analysis results, re-run workflows with varying parameter settings and data bindings, or simply debug or confirm 'strange' results.

"When we started Kepler a few years back as a grass-roots collaboration between the SEEK and SDM/SPA projects, we did not fully anticipate the broad interest scientific workflows would create", says co-PI Matt Jones, from the National Center for Ecological Analysis and Synthesis at UC Santa Barbara, adding, "The different groups in the Kepler community are pushing various extensions to the base system functionality, so it is now a perfect time to move Kepler from a research prototype to a reliable and easily extensible system."

Timothy McPhillips, co-PI at the UC Davis Genome Center, and chief software architect for Kepler/CORE adds, "To serve the target user communities, the system must be independently extensible by groups not directly collaborating with the team that develops and maintains the Kepler/CORE system. Facilitating extension in turn requires that the Kepler architecture be open and that the mechanisms and interfaces provided for developing extensions be well designed and clearly articulated."

Kepler/CORE development is informed and driven by various stakeholders, those projects and individuals who employ Kepler and wish to extend or otherwise improve the system for their specific needs. The inclusion of stakeholders in the steering of the overall collaboration aims at a more comprehensive and sustainable approach for future Kepler extensions.

"For Kepler to be seen as a viable starting point for developing workflow-oriented applications, and as middleware for developing user-oriented scientific applications, Kepler must be reliable both as a development platform and as a run-time environment for the user." says Ilkay Altintas, Kepler/CORE co-PI at the San Diego Supercomputer Center at UC San Diego.

While Kepler/CORE is primarily a software engineering project, many interesting computer science research problems are emerging from the application of scientific workflows: "As a computer scientist it is fascinating to see how real-world scientific-workflow problems--workflow design, analysis, and optimization for example--lend themselves to exciting research problems in computer science, spanning the areas of databases, distributed and parallel computing, and programming languages", says Ludaescher.

Shawn Bowers, co-PI and computer scientist at the UC Davis Genome Center adds, "Scientific-workflow systems such as Kepler provide an opportunity to make scientific results more transparent and reproducible by capturing their provenance. Enhancing scientific workflows in this way we can dramatically improve the usability of scientific results for scientists and the broader public."

For more information see Kepler-Project and Kepler/CORE respectively, or contact kepler-core at kepler-project dot org.

Document Actions

Print this

Sections

Personal tools

Kepler/CORE funded, Sept 1, 2007

Document Actions