Bertram Ludäscher is a professor at the Graduate School of Library and Information Science (GSLIS) at the University of Illinois, Urbana-Champaign, and the Director of the Center for Informatics Research in Science and Scholarship (CIRSS). He also holds affiliate faculty appointments at the National Center for Supercomputing Applications (NCSA) and the Department of Computer Science at UIUC. Prior to joining Illinois he was a professor at the Computer Science department and the Genome Center at UC Davis. His research interests are in scientific data and workflow management, knowledge representation and reasoning. He is one of the founders of the Kepler scientific workflow system, and a member of the DataONE leadership team, focusing on data and workflow provenance. Until 2004 he was a research scientist at the San Diego Supercomputer Center (SDSC). He received his MS in computer science from the Technical University of Karlsruhe (now KIT) and his PhD from the University of Freiburg, both in Germany.
An often touted advantage for using scientific workflow systems is their ability to capture provenance information during execution. The idea is that a controlled environment such as a workflow system makes it easy to record relevant observables, e.g., data read and write events. The captured provenance can then be used to document data lineage, to debug faulty runs, to speed up re-runs of workflows by reusing unchanged parts, or more generally, to support the reproducibility of computational science experiments. In this talk, I will first give an overview of different notions, forms, and research questions around data and workflow provenance. The database community, e.g., has developed specialized notions such as why, how, where, and why-not provenance. The scientific workflow community, on the other hand, has focused on forms of "black-box provenance", capturing, e.g., actor invocations and file I/O events to track possible data dependencies. Both communities share an interest in querying and analyzing provenance information. In the second part of the talk I will take a critical look at the current use of provenance information from scientific workflows and scripts and argue that open, interoperable tools are needed that can combine different forms of available provenance, e.g., recorded or reconstructed retrospective provenance together with prospective provenance given by a workflow specification or via high-level user-defined annotations in scripts. To this end, I will describe YesWorkflow, a new project and toolkit under development that combines different forms of provenance information to allow users to answer questions about the data created and used during workflow runs and script executions. An important source of provenance in the YesWorkflow approach are simple user annotations that represent a user's conceptual model of a workflow. In this way, YesWorkflow can link low-level provenance observables with high-level questions users often need to answer to conduct their (computational) science. Thus, in addition to outward-facing “provenance for others”, YesWorkflow emphasizes the utility of provenance for the researchers’ own purposes.