SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 1. Using Provenance to Streamline Data Exploration through Visualization

Size: px

Start display at page:

Download "SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 1. Using Provenance to Streamline Data Exploration through Visualization"

Harry Dalton
5 years ago
Views:

1 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 1 Using Provenance to Streamline Data Exploration through Visualization Steven P. Callahan, Juliana Freire, Member, IEEE, Emanuele Santos, Carlos E. Scheidegger, Cláudio T. Silva, Member, IEEE, and Huy T. Vo Abstract Scientists are faced with increasingly larger volumes of data. To analyze and validate various hypotheses, they need to create insightful visual representations of both observed data and simulated processes. Often, insight comes from comparing multiple visualizations. But data exploration through visualization requires scientists to assemble complex workflows and today, this process contains many error-prone and time-consuming tasks. We show that a new action-based model for capturing and maintaining detailed provenance of the visualization process can be used to streamline data exploration and reduce the time to insight. We describe an implementation of this infrastructure in the VisTrails system, where it enables the flexible re-use of pipelines, a scalable mechanism for creating a large number of visualizations, and the comparison of pipelines and their results. This provenance not only ensures reproducibility, but also allows scientists to easily navigate through the space of pipelines and parameter settings used in an exploration task. Index Terms Visualization, visualization systems, scientific exploration, provenance, history I. INTRODUCTION VISUALIZATION is a key enabling technology in the endeavor to comprehend the vast amounts of data being produced and acquired. To successfully analyze and validate various hypotheses, it is necessary to pose several queries, correlate disparate data, and create insightful visualizations of both the simulated processes and observed phenomena. However, data exploration through visualization often requires scientists to go through several steps. Before users can view and analyze results, they need to assemble and execute complex pipelines by selecting data sets, specifying a series of operations to be performed on the data, and creating an appropriate visual representation [18], [35]. Often, insight comes from comparing the results of multiple visualizations created during the exploration process. S. Callahan, E. Santos, C. Scheidegger, C. Silva, and H. Vo are with the Scientific Computing and Imaging Institute, University of Utah. {stevec,emanuele,cscheid,csilva,hvo}@sci.utah.edu J. Freire is with the School of Computing, University of Utah. juliana@cs.utah.edu Comparative visualization is essential when applying a given visualization process to multiple datasets generated in different simulations, when varying the values of certain visualization parameters, or when applying different variations of a given process (e.g., which use different visualization algorithms) to a dataset. Unfortunately, today this exploratory process contains many manual, error-prone, and time-consuming tasks. As a result, users spend much of their time managing the data and processes rather than on scientific investigation and discovery. As an example, consider the CORIE 1 project that was established to study and model the spatial and temporal variablility of the Lower Columbia River. CORIE scientists make heavy use of visualization to understand large volumes of data gathered from sensors spread along the river as well as derived from their simulation models. But the generation and maintainance of visualizations is a major bottleneck for the CORIE project. For example, to generate a new visualization that calibrates a simulation of the tidal cycle, a coordinated effort that includes several project members is needed. A modeling expert must run the simulation with the right combination of parameters; the sensor data must be collected and prepared by another staff member; and finally, a visualization expert constructs the visualizations. In the end of this process, captions and legends are often the only metadata available. This makes it hard, and sometimes impossible, to reproduce the visualization. For the thousands of visualizations that are published daily on the CORIE Web site, scripts are created that automate their generation. But finding and running the scripts are tasks that can only be performed by their creators. Even for their creators, it is hard to keep track of the correct versions of scripts and data they manipulate. Such ad-hoc processes are not suitable for the exploratory nature of data visualization and they can hinder scientists ability to understand the data and gain insights. In this paper, we propose a novel infrastructure whose goal is to simplify and streamline the process of 1

SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2 Fig. 1. An example of exploratory visualization for comparing isosurface extraction techniques using VisTrails.

2 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2 Fig. 1. An example of exploratory visualization for comparing isosurface extraction techniques using VisTrails. Complete provenance of the creation and exploration process is displayed in a history tree on the left. One pipeline that combines five different software libraries is shown in the middle. A parameter exploration of this pipeline for three different datasets and isosurface parameters is shown in the Visualization Spreadsheet on the right. data exploration through visualization. Visualization Systems: The State of the Art. Visualization systems such as VTK [27] and SCIRun [25] allow the interactive creation and manipulation of complex visualizations. These systems are based on the notion of dataflows [21], and they provide visual interfaces to produce visualizations by assembling pipelines2 out of modules (or functions) connected in a network. Although these systems allow the creation of complex visualizations, they have important limitations which hamper their ability to support the data exploration process at a large scale: they lack adequate support for collaborative creation and exploration of a large number of visualizations. Because these systems do not distinguish between the definition of a dataflow and its instances, to execute a given dataflow with different parameters (e.g., different input files), users need to manually set these parameters through a GUI. Clearly, this process does not scale to more than a few visualizations. Additionally, modifications to parameters or to the definition of a dataflow are destructive no change history is maintained. This places the burden on the user to first construct the visualization and then to remember the input data sets, parameter values, and the exact dataflow configuration that led to a particular image. Finally, they do not exploit optimization opportunities during pipeline execution. For example, SCIRun not only materializes all intermediate results, but it may also repeatedly recompute the same results over and over again if so defined in the visualization pipeline. Action-Based Provenance and Data Exploration. In this paper we describe a new action-based provenance3 model we designed for VisTrails. Unlike visualization systems [19], [25] and scientific workflow systems [1], [24], [29], [32], VisTrails captures detailed provenance of the exploratory process. The action-based model uniformly captures both changes to parameter values and to pipeline definitions by unobtrusively tracking all changes users make to dataflows in an exploration task. We refer to this detailed provenance of the pipeline evolution as a visualization trail, or a vistrail. The stored provenance ensures reproducibility of the visualizations, and it also allows scientists to easily navigate through the space of dataflows created for a given exploration task. The VisTrails interface gives them the ability to query, interact with, and understand the history of the visualization process. In particular, they can return to previous versions of a pipeline and change the specification or parameters to generate a new visualization without losing previous changes (see Figure 1). Another important feature of the action-based provenance model is that it enables a series of operations that greatly simplify the exploration process and have the potential 3 Provenance 2 In this paper, we refer to pipeline, dataflow, and workflow interchangeably. refers to all information needed to reproduce a certain piece of data or assertion. It is also referred to as audit trail, lineage, or pedigree.

3 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 3 to reduce the time to insight. In particular, it allows the flexible re-use of pipelines and it provides a scalable mechanism for creating and comparing a large number of visualizations as well as their corresponding pipelines. Outline and Contributions. This paper provides the first detailed description of the VisTrails provenance and data exploration infrastructure. In Section II, we review related work. A brief overview of VisTrails is given in Section III. In Section IV, we present the actionbased provenance mechanism which captures both data and dataflow provenance. In Section V, we describe general mechanisms for enabling pipeline re-use, computing pipeline differences, and automatically generating a large number of visualizations through parameter-space explorations. To evaluate the effectiveness of VisTrails in different settings and application domains, we performed an expert review which is discussed in Section VI. II. RELATED WORK The first implementation of VisTrails [2] only tracked provenance of visualizations: for a given visualization, it stored the steps and parameters that led to the visualization. In a more recent work [6], we introduced the notion of action-based provenance and how that captures the evolution of dataflows. A brief introduction to VisTrails as a tool for exploratory visualization is given in other recent papers [7], [11]. In this paper, we expand on these introductions by giving a detailed description of the Vis- Trails data exploration infrastructure, and showing how the action-based provenance can be used to implement features that greatly simplify and streamline the scientific discovery process. We also describe new user interfaces we designed that allow users to interact with and leverage the provenance information. Visualization Systems. Several systems are available for creating and executing visualization pipelines, such as AVS [33], OpenDX [15], SCIRun [25], ParaView [19], and VTK [27]. Most of these systems use a dataflow model, where a visualization is produced by assembling visualization pipelines out of basic modules. They typically provide easy-to-use, visual interfaces to create the pipelines. However, as discussed above, these systems lack the infrastructure to properly manage a large number of pipelines and often apply naïve execution models that do not exploit optimization opportunities. The solutions we propose in VisTrails can be easily integrated with these systems. Comparative Visualization. The use of spreadsheets for displaying multiple images has been proposed to enable comparitive visualization. Levoy s Spreadsheet for Images [22] is an alternative to the flow-chart-style layout employed by many earlier systems which use the dataflow model. It devotes its screen real estate to viewing data by using a tabular layout and hiding the specification of operations in interactively programmable cell formulas. The 2D nature of the spreadsheet encourages the application of multiple operations to multiple data sets through row or column-based operations. Chi et al. [9] apply the spreadsheet paradigm to information visualization in their Spreadsheet for Information Visualization. Linking between cells is done at multiple levels, ranging from object interactions at the geometric level to arithmetic operations at the pixel level. The difficulty with both of these spreadsheets is that they fail to capture the provenance of the exploration process, since the spreadsheet only represents the latest state in the system. In VisTrails, we support concurrent exploration of multiple visualizations with the use of a spreadsheet. The interface is similar to the one proposed by Jankun-Kelly and Ma [16], and it provides a natural way to explore a multi-dimensional parameter space (Section V). Data Provenance and Workflow Evolution. The issue of provenance in the context of scientific workflows has received substantial attention recently. Most works, however, focus on data provenance, i.e., maintaining information of how the resulting data was generated [28]. This information has many uses, from purely informational to enabling the regeneration of the data, possibly with different parameters. However, while solving a particular problem, scientists often create several variations of a workflow in a trial-and-error process. These workflows may differ both in the parameter values used and in their specifications. If only the provenance of individual results is maintained, useful information about the relationship among the workflows is lost. In addition, since a lot of expert knowledge is involved in the exploratory process, the change history of both data (e.g., input parameters) and processes contains important knowledge that can potentially be extracted and re-used to solve an array of problems. To the best of our knowledge, VisTrails is the first system to provide support for tracking workflow evolution. In their pioneering work on the GRASPARC system, Brodlie et al. [4] proposed the use of a history mechanism that allowed scientists to steer an ongoing simulation by backtracking a few steps, changing parameters, and resuming execution. However, their focus was on steering time-dependent simulations, not on data exploration. Derthick and Roth [10] introduced a tree-like structure for branching history for pre-defined

4 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 4 operations that consist of queries over a database. This work did not provide a formal model for actions, nor did address the notion of a general workflow as in VisTrails. Kreuseler et al. [20] proposed a history mechanism for exploratory data mining. They use a tree-structure, similar to a vistrail, to represent the change history, and describe how undo and redo operations can be calculated in this tree structure. They describe a theoretical framework that attempts to capture the complete state of a software system. In contrast, VisTrails only tracks the evolution of the workflows and this allows for the much simpler action-based provenance mechanism described in Section IV. The issue of provenance has also been addressed in systems that provide data management infrastructure for scientific applications [12], [23]. Frew and Bose [12] described a mechanism that logs information about workflow executions in the Earth System Science Workbench (ESSW). The logged information allows them, for example, to infer linkages between science objects. Such information can also be obtained from a vistrail. It is not clear, however, whether it is possible to infer from the ESSW logs how a particular workflow evolved over time. GridDB [23] is a data-centric overlay for scientific grid data analysis. By modeling the input parameters and outputs of workflow modules as relations (the relational cover of a workflow), it provides a declarative interface for steering grid computations and for exploring parameter sets. Similarly, other systems recently introduce track the parameters changed during the user s exploration process. Groth and Streefker [14] capture viewing manipulations of a visualization and allow annotations on these explorations. Tory et al. [31] provide a parallel coordinates style interface for exploring the parameter space and track the parameters changed and display it as a history bar of the resulting images. Finally, Jankun- Kelly and Ma [17] provide a foundation for tracking visualization parameters with a goal of encapsulating the exploration process for visualization. However, unlike VisTrails, these systems do not capture workflow evolution, they only maintain data provenance a given visualization can be traced back to entities of the relational cover (the subset of data represented with relations). III. VISTRAILS: OVERVIEW Our motivation for developing VisTrails came from the realization that visualization systems lack adequate support for large-scale data exploration. VisTrails is a visualization management system that manages both the process and metadata associated with visualizations. With VisTrails, we aim to give scientists a dramatically Fig. 2. VisTrails Architecture. improved and simplified process to analyze and visualize large ensembles of simulations and observed phenomena. A. System Architecture The high-level architecture of the system is shown in Figure 2. Users create and edit dataflows using the Vistrail Builder user interface. The dataflow specifications are saved in the Vistrail Repository. Users may also interact with saved dataflows by invoking them through the Vistrail Server (e.g., through a Web-based interface) or by importing them into the Visualization Spreadsheet. Each cell in the spreadsheet represents a view that corresponds to a dataflow instance; users can modify the parameters of a dataflow as well as synchronize parameters across different cells. Dataflow execution is controlled by the Vistrail Cache Manager, which keeps track of invoked operations and their respective parameters. Only new combinations of operations and parameters are requested from the Vistrail Dataflow Execution Engine, which executes the operations by invoking the appropriate modules from the underlying libraries (e.g., VTK, ITK, matplotlib). The execution engine also interacts with the Optimizer module, which analyzes and optimizes the dataflow specifications. A log of dataflow executions is kept in the Vistrail Log. The different components of the system are briefly described below. A more detailed description of an earlier version of our system is given in previous work [2]. Dataflow Specifications. A key feature that distinguishes VisTrails from previous visualization systems is that it separates the notion of a dataflow specification from its instances. A dataflow instance consists of a sequence of operations used to generate a visualization. This information serves both as a log of the steps followed to generate a visualization a record of the visualization provenance and as a recipe to automatically regenerate the visualization at a later time. The steps can be replayed exactly as they were first executed, and they can also be used as templates they can be parameterized. For example, the visualization spreadsheet in Figure 3(b)

the web. illustrates a multi-view visualization of a single dataflow specification where the file name parameter is varied in several of the spreadsheet cells.

5 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 5 (a) (b) Fig. 3. The Vistrail Builder (a) and Visualization Spreadsheet (b) showing the dataflows that derive visualizations of fresh-water discharge from the Columbia River for different time steps retrieved from the web. illustrates a multi-view visualization of a single dataflow specification where the file name parameter is varied in several of the spreadsheet cells. Operations in a vistrail dataflow include visualization operations (e.g., VTK calls); application-specific steps (e.g., invoking a simulation script); and general file manipulation modules (e.g., transferring files between servers). To handle the variability in the structure of different kinds of operations, and to easily support the addition of new operations, we defined a flexible XML schema [5] to represent the dataflows. The schema captures all information required to re-execute a given dataflow. The schema stores information about individual modules in the dataflow (e.g., the function executed by the module, input and output parameters) and their connections how outputs of a given module are connected to the input ports of another module. The XML representation for vistrail dataflows allows the reuse of standard XML tools and technologies. An important benefit of using an open, self-describing specification is the ability to share (and publish) dataflows. This allows a scientist to publish an image along with its associated dataflow so that others can easily reproduce the results. Another benefit of using XML is that the dataflow specification can be queried using standard XML query languages such as XPath [36] and XQuery [3]. For example, an XQuery query could be used in the context of the CORIE project to locate a dataflow that provides a 3D visualization of the salinity at the Columbia River estuary (as in Figure 3) from a database of published dataflows. Once the dataflow is found, he could then apply the same dataflow to more current simulation results, or modify the dataflow to test an alternative hypothesis. Caching, Analysis and Optimization. Having a highlevel specification allows the system to analyze and optimize dataflows. Executing a dataflow can take a long time, especially if large data sets and complex visualization operations are used. It is thus important to be able to analyze the specification and identify optimization opportunities. Possible optimizations include, for example, factoring out common subexpressions that produce the same value; removing no-ops; identifying steps that can be executed in parallel; and identifying intermediate results that should be cached to minimize execution time. Although most of these optimization techniques are widely used in other areas, they have yet to be applied in dataflow-based visualization systems. In our current VisTrails prototype, we have implemented memoization (caching). VisTrails leverages the dataflow specification to identify and avoid redundant operations. The algorithms and implementation of the Vistrail Cache Manager (VCM) are described in previous work [2]. Caching is especially useful while exploring multiple visualizations. When variations of the same dataflow need to be executed, substantial speedups can be obtained by caching the results of overlapping subsequences of the dataflows. Running a Dataflow. The Vistrail Dataflow Execution Engine receives as input an a dataflow instance and executes it by invoking the appropriate libraries. Currently,

6 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 6 Fig. 4. A snapshot of the VisTrails history management interface. Each node in the vistrail history tree represents a dataflow version. An edge between a parent and child nodes represents a set of actions applied to the parent to obtain the dataflow for the child node. VisTrails supports VTK, ITK, HTTP, SciPy, ImageMagick, Web services, and Matplotlib among others. New packages are easily ported and are frequently added. 4 VisTrails provides a plug-in mechanism that greatly simplifies the addition of new dataflow modules. To add a new method, a Python [34] wrapper needs to be created that defines the input and output parameters, as well as specifies how to invoke the method. Details on the plugin wrapping process are available on the VisTrails project Web site. 5 The Execution Engine is unaware of caching. To accommodate caching in the execution engine, we use a class that behaves essentially like a proxy. To the rest of the dataflow, it looks perfectly like a filter, but instead of performing computations, it simply looks up the result in the cache. The VCM is responsible for replacing a complex subnetwork that has been previously executed with an appropriate instance of the caching class. After the (partial) dataflow is executed, its outputs are stored in new cache entries. Information pertinent to the execution of a particular dataflow instance is kept in the Vistrail Log (see Figure 2). There are many benefits of keeping this information. One of these benefits is the ability to debug the application e.g., it is possible to check the results of a dataflow using simulation data against sensor data. Another important benefit is that it reduces the cost of failures if a visualization process fails, it can be restarted from the failure point. The latter is especially useful for long running processes, as it may be very expensive and time-consuming to execute the whole process from scratch. Logging all the information associated with all dataflows may not be feasible. VisTrails provides an interface that lets users select which and how much information should be saved. Creating and Interacting with Vistrails. The Vistrail Builder (VB) allows users to create and edit dataflows (see Figure 3(a)). It reads and writes dataflows in the same XML format as the other components of the system. It shares the familiar nodes-and-connections paradigm with dataflow systems. To automatically generate the visual representation of the modules, it reads the same data structure used by plug-in wrappers. Like the Execution Engine, the VB requires no change to support additional modules. The VisTrails Visualization Spreadsheet allows users to compare the results of multiple dataflows. It consists of a set of separate visualization windows arranged in a tabular view. This layout makes efficient use of screen space, and the row/column groupings can conceptually help the user explore the visualization parameter space [8], [9] (see Section V-C). The cells in a spreadsheet may execute different dataflows and they may also use different parameters for the same dataflow specification (see Figure 3). To ensure efficient execution, all cells share the same cache. Note that cells in a spreadsheet can be synchronized in many different ways. For example, in Figure 7, cells are synchronized with respect to the camera viewpoint if one cell is rotated, the same rotation is applied to the other synchronized cell. This figure also illustrates the usefulness of the spreadsheet to explore the parameter space of an application. In this case, it allows different visualizations of MRI data of a head to be compared: different Alpha parameters of an image filter used in the horizontal axis, and different Beta parameters are used in the vertical axis.

7 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 7 IV. ACTION-BASED PROVENANCE VisTrails uses an action-based model to capture provenance. As the scientist makes modifications to a particular dataflow, the provenance mechanism records those changes. Instead of storing a set of related dataflows, we store the operations or actions that are applied to the dataflows (e.g., the addition of a module, the modification of a parameter, etc.). This representation is both simple and compact it uses substantially less space than the alternative of storing multiple versions of a dataflow. In addition, it enables the construction of an intuitive interface that allows scientists to both understand and interact with the history of the dataflow through these changes, as illustrated in Figure 4. A tree-based view allows a scientist to return to a previous version in an intuitive way, to undo bad changes, to compare different dataflows, and to be reminded of the actions that led to a particular result. This, combined with a caching strategy that eliminates redundant computations, allows the scientist to efficiently explore a large number of related visualizations. VisTrails is the first system to provide a mechanism that uniformly captures provenance information for both the data and the processes that derive the data. Along with the benefits of previously proposed provenance mechanisms, such as reproducibility and documentation, the action-based model has a number of additional benefits in the context of data exploration. As we discuss in Section V, it enables operations that streamline the data exploration process. Vistrail: An Evolving Dataflow. A vistrail V T is a rooted tree in which each node corresponds to a version of a dataflow, and each edge between nodes d p and d c, where d p is the parent of d c, corresponds to the action applied to d p which generated d c. This is similar to the versioning mechanism used in Darcs [26]. The vistrail in Figure 4 shows a set of changes to a dataflow that was used to generate some CORIE visualizations similar to those shown in Figure 3. In this case, a dataflow to visualize the salinity in a small section of the estuary. This dataflow (tagged With Text ) was used to create four different dataflows that represent different time steps of the data. More formally, let DF be the domain of all possible dataflow instances, where /0 DF is a special empty dataflow. Also, let x : DF DF be a function that transforms a dataflow instance into another, and D be the set of all such functions. A vistrail node corresponding to a dataflow d is constructed by composing a sequence of actions, where each x i D: d = (x n (x n 1... (x 1 (/0))...)) type Vistrail = Action*, annotation? ] type Action = tag?, annotation?, (addmodule deletemodule replacemodule addconnection deleteconnection setparameter changeparameter)] Fig. 5. Excerpt of the vistrail schema. In what follows, we use the following notation: we represent dataflows as d p, and if a dataflow d c is created by applying a sequence of actions on d p, we say that d p < d c (i.e., the vistrail node d c is a descendant of d p ). In addition, /0 VT, and d n VT, if d n /0, then /0 < d n. An excerpt of the vistrail XML schema is shown in Figure 5. For simplicity of the presentation, we only show a subset of the schema and use a notation less verbose than XML Schema. A vistrail has a unique ID, a name, an optional annotation, and a set of actions. Each action is uniquely identified by an ID (@id), the ID of its parent (@parentid), its creator (@user), and the time of creation (@date). The different actions we have implemented in our current prototype include: adding, deleting and replacing dataflow modules; adding and deleting connections; and setting parameter values. To simplify the retrieval of particularly interesting versions, a node may have a name (the optional attribute tag in the schema) as well as additional notes from the creator (annotations). Note that the action-based provenance provides a very concise representation of the exploratory process. Figure 1 shows an example of the history captured during a single user s exploration of isosurface extraction techniques for several days. The user constructed several visualizations and tagged only the more important milestones in the exploration process. Although the complete vistrail contains 900 actions, it only requires only 25 KB of disk space when compressed far less space than the actual dataset being visualized, or even if the dataflow specifications of tagged versions were saved. V. DATA EXPLORATION THROUGH PIPELINE MANIPULATIONS Capturing provenance by recording changes applied to dataflows has benefits in both uniformity and compactness of representation. In addition, it allows powerful data exploration operations through direct manipulation of the history tree. Due to space limitations, we only describe a few of these operations. First, we describe our interface for interacting with provenance to gain a better understanding of the exploration process than a single visualization specification provides. Second, we show that

SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 8 stored actions lend themselves very naturally to reuse through an intuitive macro mechanism.

8 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 8 stored actions lend themselves very naturally to reuse through an intuitive macro mechanism. Third, we describe a bulk-update mechanism that allows the creation of a large number of visualizations of an n-dimensional slice of the parameter space of a dataflow. Finally, we describe how the differences between dataflows can be efficiently computed in the action-based model. A. Interacting with Provenance Figures 1, 3(a), and 4 show different history trees. Each node in a history tree corresponds to one instance of a pipeline. An edge between two nodes represents one action that occured between pipelines. To avoid visual clutter, in addition to a full tree view, we provide a default condensed view of the tree. The condensed view shows only the nodes that the user deems significant and assigns a tag. Since a tag is only metadata, it can easily be changed or even removed, thus changing the representation of the tree. A series of actions between tagged nodes is represented by three perpendicular lines that can be expanded to the full representation with a double click. As described above, additional metadata (e.g., annotations, creator, and date of creation) is recorded with each action and is displayed with the history tree. Annotations can be provided by the user to facilitate searching, clarify the exploration process, and provide the additional information that may be needed for a complete audit trail. When a node is selected, the metadata is displayed in the right panel of the history tree viewer, as shown in Figure 1 (left). In addition, the creator and date information are incorporated as visual cues in the history tree. The color of the nodes denotes the creator (blue nodes were created by the user and orange nodes were created by someone else) and the saturation of the node denotes chronology (more recent nodes are more saturated). These cues, along with the tags and annotations, allow a user to open a new vistrail and quickly understand the visualization process. Inevitably, as a user is creating and exploring visualizations, mistakes will be made or the tree will grow too large. Thus, we provide pruning operations that hide the extraneous nodes (no provenance information is actually deleted). Pruning the tree can currently be performed in two ways. The first, and simplest, pruning operation is to manually select the nodes to be removed and press the delete key. The second pruning operation is performed through a search and refine interface. A search will highlight only the nodes that match a query, whereas a refine will collapse the history tree to contain only the matching nodes. Both search and refine features Fig. 6. A vistrail macro consists of a sequence of change actions which represent a fragment of a dataflow. Here we show an example of a series of actions that occur on one version that can be applied to other similar versions using a macro. use a keyword-based text input that is provided in the interface. Queries over modules, parameters, tags, annotations, users, and dates can be performed directly by directly typing words or by using a simple query interface. For example, a version created by stevec in March that uses a vtkrenderer could be queried as user:stevec date:mar vtkrenderer. These features are especially useful in a collaborative setting. B. Reusing Stored Provenance Visualization pipelines are complex and require deep knowledge of visualization techniques and libraries (e.g., VTK [27]). Even for experts, creating large pipelines is time-consuming. As complex visualization pipelines contain many common tasks, mechanisms that allow the re-use of pipelines or pipeline fragments are key to streamlining the visualization process. Figure 6 shows a simplified example of a vistrail that describes the steps for a pipeline that reads a dataset, applies a colormap, creates a scalar bar, and finally, adds some text before rendering the images. Since the last three steps are needed for each time step, they can be made into a macro re-used in all these pipelines. The action-based model leads to a natural representation for a macro. A macro is just a sequence of actions x j x j 1... x i Inserting a macro into a vistrail node d i is also conceptually simple. The sequence of actions in the macro is composed with the actions in node d i : (x j x j 1... x i ) d i There are several possible ways of defining a macro. The simplest is to do so directly on the history tree, by specifying a pair of versions d j and d i, where d i < d j,

9 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 9 Fig. 7. Visualization Spreadsheet. Parameters for a vistrail loaded in a spreadsheet cell can be interactively modified by clicking on the cell. Cameras as well as other vistrail parameters for different cells can be synchronized. This spreadsheet contains visualizations of MRI data of a head and was generated procedurally with the parameter exploration interface shown on the right. The horizontal axis varies one parameter of an image filter and the vertical axis explores another. The interface allows the user to manage the layout of cells when multiple visualizations are produced by one pipeline as shown above with the two views (labeled 1 and 2). and define the macro as the sequence of actions that takes d i into d j. The more common way of defining a macro is by interactively selecting a set of modules and connections in a given pipeline, so that these modules can be re-created in a different pipeline (i.e., copy and paste). Although a macro corresponds to a set of modules and connections in a dataflow, it is represented as a set of actions. Since dataflows are not stored directly, it is necessary to identify the actions in a vistrail that create and modify the modules and connections selected. However, between the first action x i and the last action x j, there might be actions that change parts of the pipeline that were not selected. These potentially irrelevant actions must be removed. In the current implementation, we perform a simple analysis and remove actions that are applied to modules that are not created between x i and x j. More complex, application-dependent analyses are possible. When a user defines a macro, some of the actions may create new dataflow modules and connections. Even though some of the connections will be between modules created inside the macro, some may connect to previously existing modules. Thus, mechanisms are needed to identify modules that match these dangling connections in a dataflow. To aid in the identification of matching modules when a macro is applied, we store the external modules of the original dataflow from which the macro was extracted along with the macro. We are currently investigating alternative strategies for automatically locating candidate matches to better guide the user in selecting the appropriate modules. C. Scalable Derivation of Visualizations The action-oriented model also leads to a very natural means to script dataflows. The parameter space of a dataflow d, denoted by P(d), is the set of dataflows that can be generated by changing the value of parameters of d. From x k,,x i, we can derive P(d) by tracking addmodule and deletemodule actions, and knowing P(m), the parameter space of module m, for each module in the dataflow. Each parameter can then be thought of as a basis vector for the parameter space of d. Using the action-based model, instances for the basis vectors can be produced by a sequence of setparameter actions. Dataflows spanning an n-dimensional subspace of P(d) are generated as follows: setparameter(id n,value n ) setparameter(id 1,value 1 ) d The ability to apply bulk updates over a set of distinct dataflows greatly simplifies the exploration of the parameter space for a given task and provide an effective means to create a large number of visualizations. Our bulk-update interface is accessed through a tab in the VisTrail Builder (see Figure 7). The parameters that have been set throughout the pipeline are displayed in the right column. All the parameters the user desires to explore are dragged into the main window. Our interface allows the user to select one of four dimensions for a parameter to be mapped to in the spreadsheet: horizontal rows, vertical columns, multiple sheets, and animations through time in a single cell. Additionally, there are several options for defining the values to explore. The

SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS first is a simple interpolation that occurs between a start and end parameter and is most commonly used for simple integer or

10 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS first is a simple interpolation that occurs between a start and end parameter and is most commonly used for simple integer or floating point values. The second is a user-specified list that can be used for any data type. This option is intended for exploring strings such as file names or labels. Finally, the most flexible option provides an interface for creating a Python function that defines the values. This option is intended for providing non-linear sequences of values for exploration. Mapping bulk updates to a spreadsheet is relatively straightforward when the pipeline produces a single visualization. However, this is not always the case, as shown by the different slice axes used in Figure 7. We provide an interface that allows the user to specify the spreadsheet layout of multiple visualizations during bulk updates. An annotated snapshot of the pipeline is displayed that shows a number on each module that appears multiple times in the pipeline. For example, in the figure, several modules are repeated twice and are thus assigned a 1 or a 2 to distinguish them in the explorations. The layout of the outputs are performed using a virtual cell interface that allows annotated cells to be stacked horizontally and vertically (shown on the bottom right of the figure). This gives the user the ability to manage multiple visualizations efficiently. Note that since parameter value modifications and dataflow modifications are captured uniformly by the action-based provenance, a similar mechanism can be used to explore spaces of different dataflow definitions. For example, if a scientist needs to compare visualizations generated by different algorithms, she can create a series of visualizations using the first algorithm, and then apply a bulk update that creates new dataflows by replacing the occurrences of the first algorithm with the second and keeping all the other modules and parameter settings intact. Together with the visualization spreadsheet provided by VisTrails, the parameter exploration mechanism allows users to effectively compare different visualizations. In addition, since VisTrails identifies and avoids redundant operations, dataflows generated from parameter explorations can be executed efficiently the operations that are common to the set of affected dataflows need only be executed once. D. Computing Pipeline Differences The visualization discovery process requires many trial-and-error steps, and it is not uncommon for tens to hundreds of different pipelines to be generated in the course of a study. In such a scenario, it becomes important to understand what the differences between 10 Fig. 8. Computing the differences between two dataflows that derive visualizations of CT data of a lung containing pathological tissue. In VisTrails, users can select nodes (dataflows) in the vistrail tree to be compared. Here, the user selected the dataflows labeled Raycasted and Texture Mapped. Their corresponding visualizations are shown on the right and the differences between the two dataflows are shown in the bottom. Modules shown in blue are present only in Texture Mapped, orange modules are present only in RayCasted, dark grey modules are present in both dataflows, and light-grey modules have different parameter values in the two versions as shown on the bottom left for the vtkcolortransferfunction module. pipelines are, especially if multiple people are collaboratively exploring the data. Computing the differences between two pipelines by considering their underlying graph structure turns out to be impractical. In fact, the related decision problem of subgraph isomorphism (or matching) is known to be NP-complete [13]. This problem becomes much easier using the actionbased provenance model. Let d1 and d2 be two vistrail nodes corresponding to the pipelines P1 and P2 respectively: d1 = xk xk 1 x1 0/ d2 = yi yi 1 y1 0/ Since all the vistrails nodes are represented by paths in a rooted tree, they all share at least one common ancestor. Let c = xm = y j = lca(xk, yi ) be the least common ancestor of xk and yi. The common part of d1 and d2 is the path from the root of the tree down to node c, i.e., {c 0}. / The difference computation is simple and efficient it can be done in linear time: after obtaining the least common ancestor of the nodes, the difference d2 d1 is represented by two lists of actions: {xk... xm+1 } and {yi... y j+1 }. Using these lists, we identify the modules that are unique in the two pipelines, and for modules that are present in both pipelines, we identify

11 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 11 their differences by examining setparameter actions. For display, we divide the nodes into four categories: shared nodes, shared nodes with parameter changes, nodes that belong only to P 1, and nodes that belong only to P 2. Edges are divided into shared edges, and edges that belong to P 1 or P 2 respectively. A remaining issue is how to handle nodes that have been deleted and re-inserted to a given dataflow. We handle this case by inspecting module names and function signatures. Figure 8 shows the visual difference interface provided by VisTrails. A visual difference is enacted by dragging one node in the history tree onto another, which opens a new window with a difference pipeline. The difference pipeline displays modules unique to the first node in orange, modules unique to the second node in blue, modules that are the same in dark grey, and modules that have different parameter values in light grey. The parameter changes are displayed in a subwindow when a light grey module is selected. Using this interface, users can easily correlate differences between two visualizations with differences between their corresponding specifications. VI. EVALUATION Source code, binaries, and documentation for VisTrails are available at the project web page 6. In only a few months, the system has been downloaded over 400 times and is currently being used in several scientific settings. To evaluate the utility of VisTrails in these settings, we performed an informal expert review. Expert reviews have been shown to be a useful means of evaluation, and require fewer reviewers than standard user studies [30]. We collected comments and suggestions from four different users that have been using VisTrails in their research. A summary of the projects and comments are provided below. Review 1: Ocean Simulations. VisTrails is currently being used to jump-start 3D visualization capabilities for a large-scale ocean circulation simulations for the CORIE project. According to our reviewer, oceanographers traditionally eschew 3D visualizations due to the difficulty in tuning and interpreting the increased number of parameters that are a result of increased dimensionality. The parameter exploration capabilities of VisTrails are being used to overcome this hurdle. This functionality is used not only to adjust characteristics of the visualizations, but also to select the data to be visualized. According to the reviewer, this technique allows VisTrails to transcend its role as a visualization 6 tool and become a query interface to our data repository. In addition, the provenance and execution mechanism in VisTrails is being used for a custom querying interface for accessing and manipulating simulation results. With VisTrails, these query patterns can be developed, tweaked, stored, and reused flexibly: they can be called with the convenience of a function, but inspected visually as the algebraic expression that they are. Previous to VisTrails, the group used scripts to create visualizations and had no means of managing the creation and maintanance process. Review 2: Mesh Quality Comparisons. Our second reviewer uses VisTrails to perform comparative analysis of isosurface extraction algorithms (see Figure 1). These comparisons typically involve images of the resulting mesh along with several histograms that display quality information of the mesh. At least three unique software packages are used just to create the different visualizations. This task requires frequent workflow modifications and substantial parameter tuning during the analysis process. Before VisTrails, the reviewer managed the data and software manually with scripts. According to the reviewer, the fact that Vistrails has a supportive provenance system allows me to explore different alternatives with more freedom, since I know that the different alternatives that I pursued before can be easily recovered. He also had positive feedback about the interface: navigation is easy and querying or exploring different pipelines is done in a nice and intuitive way. As areas of improvement, the reviewer suggested that the system should include more automatic way for clearing unwanted branches in the version tree. Overall, the reviewer was very positive and mentioned that he intends to use VisTrails in future projects. Review 3: Experimental High Energy Physics. Vis- Trails is currently being used to develop a workflow management system for performing Experimental High Energy Physics (HEP) at a well known research institution. In particular, the pipeline builder and provenance capturing mechanism are used extensively. Due to the flexibility of the system, it was easily adapted to include new modules, change the execution model, and add a custom scientific notebook for the results. The new execution model allows for asynchronous processing of the workflow with incremental updates of the results in order to give fast feedback to the physicist. The custom modules correspond to fine-grain programming structures (i.e., for loops and if statements). This fine granularity allows the use of VisTrails provenance tracking to record all the filtering changes typically done during an HEP analysis. The reviewer commented:

SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 12 I feel tracking of provenance is an essential part of scientific research and the HEP field does not have any tools to aid

12 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 12 I feel tracking of provenance is an essential part of scientific research and the HEP field does not have any tools to aid with provenance tracking. However, to me provenance tracking is about what you did while what VisTrails does is even better, which is replay what you did and use that as a starting point for new changes. It is the latter ability which enticed me to use VisTrails. The reviewer also made some interesting suggestions for future versions of VisTrails: providing a mechanism for versioning external packages; allowing the user to assign importance to named versions in the history tree; and providing conventional undo/redo options during pipeline creation. The reviewer also noted that he evaluated the use of other systems before VisTrails, but quickly abandoned them due to the lack of provenance tracking. Review 4: Memory Studies. VisTrails is being used as a primary tool to study the effect of Repetitive Transcranial Magnetic Stimulation on the working memory centers of the brain. The study manages and analyzes MRI, EEG, and MEG data for many test subjects (see Figure 9). As the amount of data grows, it is increasingly important to track the changes to the analysis methodologies as well as the resulting visualizations. VisTrails provides a seamless method for doing this as any change to the data analysis pipeline is automatically recorded. According to the reviewer: provenance in general is not something that may be taken away from the scientific process. Without reproducible results, there is no science. For feedback, the reviewer suggested that the search and refine features of the history tree are useful, but the system should allow the user to hide extraneous nodes in the history tree as their importance changes. The reviewer also suggested that the ability to create custom widgets (i.e., a transfer function editor) would be useful for some applications. The reviewers provided important feedback and suggestions that will be used to improve the system. The suggestions of the reviewers are currently being incorporated into VisTrails and will be included in future releases of the software. In fact, some of the pruning functionality described in Section V was recently added to address reviewers concerns about removing unwanted nodes in the history tree. VII. CONCLUSION In this paper we propose a new infrastructure for streamlining data exploration through visualization. The infrastructure leverages a new action-based provenance mechanism to provide useful features that allow effective exploration of visualizations over large parameter spaces. Fig. 9. An example of the history tree, pipeline, and spreadsheet used for visualizing measured data for a single patient in a memory study. This pipeline utilizes the VisTrail execution model to combine VTK, ITK, and Matplotlib libraries. The provenance information captured by VisTrails can be used to augment existing scientific data repositories with the process that scientists go through to generate and analyze data. For example, a scientist can publish the vistrail used to generate the images in a paper. This has the obvious benefit of allowing scientific results to be reproduced. In addition, since substantial expert knowledge is involved in the exploratory process, having this information creates the opportunity for the development of mining techniques that extract knowledge, in the form of exploratory patterns, which can be used to solve an array of problems. We are currently applying VisTrails to a number of different problem areas, from environmental sciences and medical imaging, to computer-aided drug design and discovery. Our initial experiences have confirmed that the VisTrails data exploration infrastructure greatly simplifies the scientific discovery process, and that it indeed allows scientists to more effectively explore vast amounts of data. This indicates that appropriate management of both data and processes involved in visualization has the potential to substantially increase the impact of visualization in scientific discovery. Although VisTrails was originally designed to support data exploration through visualization, ideas developed in the context of VisTrails have been successfully applied to scientific workflow systems in different domains. For example, algorithms developed for VisTrails have also been used in Kepler, a general scientific workflow system [24]; and Taverna, workflow workbench developed as part of the UKs my Grid project, has recently added support for tracking the evolution path of workflows [37].

Managing the Evolution of Dataflows with VisTrails Extended Abstract

Managing the Evolution of Dataflows with VisTrails Extended Abstract Steven P. Callahan Juliana Freire Emanuele Santos Carlos E. Scheidegger Cláudio T. Silva Huy T. Vo University of Utah vistrails@sci.utah.edu