Data-intensive Workflows (a.k.a. scientific workflows) are routinely used in most scientific disciplines today, especially in the context of parallel and distributed computing. Workflows provide a systematic way of describing the analysis and rely on workflow management systems to execute the complex analyses on a variety of distributed resources. They are at the interface between end-users and computing infrastructures. With the dramatic increase of raw data volume in every domain, they play an even more critical role to assist scientists in organizing and processing their data and to leverage HPC or HTC resources, e.g., workflows played an important role in the discovery of Gravitational Waves.
This workshop focuses on the many facets of data-intensive workflow management systems, ranging from job execution to service management and the coordination of data, service and job dependencies. The workshop therefore covers a broad range of issues in the scientific workflow lifecycle that include: data-intensive workflows representation and enactment; designing workflow composition interfaces; workflow mapping techniques that may optimize the execution of the workflow; workflow enactment engines that need to deal with failures in the application and execution environment; and a number of computer science problems related to scientific workflows such as semantic technologies, compiler methods, fault detection and tolerance.
Scientific computing will increasingly incorporate a number of different tasks that need to be managed along with the main simulation or experimental tasks—ensemble analysis, data-driven science, artificial intelligence, machine learning, surrogate modeling, and graph analytics—all nontraditional applications unheard of in HPC just a few years ago. Many of these tasks will need to execute concurrently, that is, in situ, with simulations and experiments sharing the same computing resources.
There are two primary, interdependent motivations for processing and managing data in situ. The first motivation is the need to decrease data volume. The in situ methodology can make critical contributions to managing large data from computations and experiments to minimize data movement, save storage space, and boost resource efficiency—often while simultaneously increasing scientific precision. The second motivation is that the in situ methodology can enable scientific discovery from a broad range of data sources—HPC simulations, experiments, scientific instruments, and sensor networks—over a wide scale of computing platforms: leadership-class HPC, clusters, clouds, workstations, and embedded devices at the edge.
The successful development of in situ data management capabilities can potentially benefit real-time decision making, design optimization, and data-driven scientific discovery. This talk will feature six priority research directions that highlight the components and capabilities needed for in situ data management to be successful for a wide variety of applications: making in situ data management more pervasive, controllable, composable, and transparent, with a focus on greater coordination with the software stack, and a diversity of fundamentally new data algorithms.
Bio. Tom Peterka is a computer scientist at Argonne National Laboratory, scientist at the University of Chicago Consortium for Advanced Science and Engineering (CASE), adjunct assistant professor at the University of Illinois at Chicago, and fellow of the Northwestern Argonne Institute for Science and Engineering (NAISE). His research interests are in large-scale parallel in situ analysis of scientific data. Recipient of the 2017 DOE early career award and four best paper awards, Peterka has published over 100 peer-reviewed papers in conferences and journals that include ACM/IEEE SC, IEEE IPDPS, IEEE VIS, IEEE TVCG, and ACM SIGGRAPH. Peterka received his Ph.D. in computer science from the University of Illinois at Chicago in 2007, and he currently leads several DOE- and NSF-funded projects.