SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 1. Using Provenance to Streamline Data Exploration through Visualization

Size: px
Start display at page:

Download "SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 1. Using Provenance to Streamline Data Exploration through Visualization"

Transcription

1 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 1 Using Provenance to Streamline Data Exploration through Visualization Steven P. Callahan, Juliana Freire, Member, IEEE, Emanuele Santos, Carlos E. Scheidegger, Cláudio T. Silva, Member, IEEE, and Huy T. Vo Abstract Scientists are faced with increasingly larger volumes of data. To analyze and validate various hypotheses, they need to create insightful visual representations of both observed data and simulated processes. Often, insight comes from comparing multiple visualizations. But data exploration through visualization requires scientists to assemble complex workflows and today, this process contains many error-prone and time-consuming tasks. We show that a new action-based model for capturing and maintaining detailed provenance of the visualization process can be used to streamline data exploration and reduce the time to insight. We describe an implementation of this infrastructure in the VisTrails system, where it enables the flexible re-use of pipelines, a scalable mechanism for creating a large number of visualizations, and the comparison of pipelines and their results. This provenance not only ensures reproducibility, but also allows scientists to easily navigate through the space of pipelines and parameter settings used in an exploration task. Index Terms Visualization, visualization systems, scientific exploration, provenance, history I. INTRODUCTION VISUALIZATION is a key enabling technology in the endeavor to comprehend the vast amounts of data being produced and acquired. To successfully analyze and validate various hypotheses, it is necessary to pose several queries, correlate disparate data, and create insightful visualizations of both the simulated processes and observed phenomena. However, data exploration through visualization often requires scientists to go through several steps. Before users can view and analyze results, they need to assemble and execute complex pipelines by selecting data sets, specifying a series of operations to be performed on the data, and creating an appropriate visual representation [18], [35]. Often, insight comes from comparing the results of multiple visualizations created during the exploration process. S. Callahan, E. Santos, C. Scheidegger, C. Silva, and H. Vo are with the Scientific Computing and Imaging Institute, University of Utah. {stevec,emanuele,cscheid,csilva,hvo}@sci.utah.edu J. Freire is with the School of Computing, University of Utah. juliana@cs.utah.edu Comparative visualization is essential when applying a given visualization process to multiple datasets generated in different simulations, when varying the values of certain visualization parameters, or when applying different variations of a given process (e.g., which use different visualization algorithms) to a dataset. Unfortunately, today this exploratory process contains many manual, error-prone, and time-consuming tasks. As a result, users spend much of their time managing the data and processes rather than on scientific investigation and discovery. As an example, consider the CORIE 1 project that was established to study and model the spatial and temporal variablility of the Lower Columbia River. CORIE scientists make heavy use of visualization to understand large volumes of data gathered from sensors spread along the river as well as derived from their simulation models. But the generation and maintainance of visualizations is a major bottleneck for the CORIE project. For example, to generate a new visualization that calibrates a simulation of the tidal cycle, a coordinated effort that includes several project members is needed. A modeling expert must run the simulation with the right combination of parameters; the sensor data must be collected and prepared by another staff member; and finally, a visualization expert constructs the visualizations. In the end of this process, captions and legends are often the only metadata available. This makes it hard, and sometimes impossible, to reproduce the visualization. For the thousands of visualizations that are published daily on the CORIE Web site, scripts are created that automate their generation. But finding and running the scripts are tasks that can only be performed by their creators. Even for their creators, it is hard to keep track of the correct versions of scripts and data they manipulate. Such ad-hoc processes are not suitable for the exploratory nature of data visualization and they can hinder scientists ability to understand the data and gain insights. In this paper, we propose a novel infrastructure whose goal is to simplify and streamline the process of 1

2 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2 Fig. 1. An example of exploratory visualization for comparing isosurface extraction techniques using VisTrails. Complete provenance of the creation and exploration process is displayed in a history tree on the left. One pipeline that combines five different software libraries is shown in the middle. A parameter exploration of this pipeline for three different datasets and isosurface parameters is shown in the Visualization Spreadsheet on the right. data exploration through visualization. Visualization Systems: The State of the Art. Visualization systems such as VTK [27] and SCIRun [25] allow the interactive creation and manipulation of complex visualizations. These systems are based on the notion of dataflows [21], and they provide visual interfaces to produce visualizations by assembling pipelines2 out of modules (or functions) connected in a network. Although these systems allow the creation of complex visualizations, they have important limitations which hamper their ability to support the data exploration process at a large scale: they lack adequate support for collaborative creation and exploration of a large number of visualizations. Because these systems do not distinguish between the definition of a dataflow and its instances, to execute a given dataflow with different parameters (e.g., different input files), users need to manually set these parameters through a GUI. Clearly, this process does not scale to more than a few visualizations. Additionally, modifications to parameters or to the definition of a dataflow are destructive no change history is maintained. This places the burden on the user to first construct the visualization and then to remember the input data sets, parameter values, and the exact dataflow configuration that led to a particular image. Finally, they do not exploit optimization opportunities during pipeline execution. For example, SCIRun not only materializes all intermediate results, but it may also repeatedly recompute the same results over and over again if so defined in the visualization pipeline. Action-Based Provenance and Data Exploration. In this paper we describe a new action-based provenance3 model we designed for VisTrails. Unlike visualization systems [19], [25] and scientific workflow systems [1], [24], [29], [32], VisTrails captures detailed provenance of the exploratory process. The action-based model uniformly captures both changes to parameter values and to pipeline definitions by unobtrusively tracking all changes users make to dataflows in an exploration task. We refer to this detailed provenance of the pipeline evolution as a visualization trail, or a vistrail. The stored provenance ensures reproducibility of the visualizations, and it also allows scientists to easily navigate through the space of dataflows created for a given exploration task. The VisTrails interface gives them the ability to query, interact with, and understand the history of the visualization process. In particular, they can return to previous versions of a pipeline and change the specification or parameters to generate a new visualization without losing previous changes (see Figure 1). Another important feature of the action-based provenance model is that it enables a series of operations that greatly simplify the exploration process and have the potential 3 Provenance 2 In this paper, we refer to pipeline, dataflow, and workflow interchangeably. refers to all information needed to reproduce a certain piece of data or assertion. It is also referred to as audit trail, lineage, or pedigree.

3 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 3 to reduce the time to insight. In particular, it allows the flexible re-use of pipelines and it provides a scalable mechanism for creating and comparing a large number of visualizations as well as their corresponding pipelines. Outline and Contributions. This paper provides the first detailed description of the VisTrails provenance and data exploration infrastructure. In Section II, we review related work. A brief overview of VisTrails is given in Section III. In Section IV, we present the actionbased provenance mechanism which captures both data and dataflow provenance. In Section V, we describe general mechanisms for enabling pipeline re-use, computing pipeline differences, and automatically generating a large number of visualizations through parameter-space explorations. To evaluate the effectiveness of VisTrails in different settings and application domains, we performed an expert review which is discussed in Section VI. II. RELATED WORK The first implementation of VisTrails [2] only tracked provenance of visualizations: for a given visualization, it stored the steps and parameters that led to the visualization. In a more recent work [6], we introduced the notion of action-based provenance and how that captures the evolution of dataflows. A brief introduction to VisTrails as a tool for exploratory visualization is given in other recent papers [7], [11]. In this paper, we expand on these introductions by giving a detailed description of the Vis- Trails data exploration infrastructure, and showing how the action-based provenance can be used to implement features that greatly simplify and streamline the scientific discovery process. We also describe new user interfaces we designed that allow users to interact with and leverage the provenance information. Visualization Systems. Several systems are available for creating and executing visualization pipelines, such as AVS [33], OpenDX [15], SCIRun [25], ParaView [19], and VTK [27]. Most of these systems use a dataflow model, where a visualization is produced by assembling visualization pipelines out of basic modules. They typically provide easy-to-use, visual interfaces to create the pipelines. However, as discussed above, these systems lack the infrastructure to properly manage a large number of pipelines and often apply naïve execution models that do not exploit optimization opportunities. The solutions we propose in VisTrails can be easily integrated with these systems. Comparative Visualization. The use of spreadsheets for displaying multiple images has been proposed to enable comparitive visualization. Levoy s Spreadsheet for Images [22] is an alternative to the flow-chart-style layout employed by many earlier systems which use the dataflow model. It devotes its screen real estate to viewing data by using a tabular layout and hiding the specification of operations in interactively programmable cell formulas. The 2D nature of the spreadsheet encourages the application of multiple operations to multiple data sets through row or column-based operations. Chi et al. [9] apply the spreadsheet paradigm to information visualization in their Spreadsheet for Information Visualization. Linking between cells is done at multiple levels, ranging from object interactions at the geometric level to arithmetic operations at the pixel level. The difficulty with both of these spreadsheets is that they fail to capture the provenance of the exploration process, since the spreadsheet only represents the latest state in the system. In VisTrails, we support concurrent exploration of multiple visualizations with the use of a spreadsheet. The interface is similar to the one proposed by Jankun-Kelly and Ma [16], and it provides a natural way to explore a multi-dimensional parameter space (Section V). Data Provenance and Workflow Evolution. The issue of provenance in the context of scientific workflows has received substantial attention recently. Most works, however, focus on data provenance, i.e., maintaining information of how the resulting data was generated [28]. This information has many uses, from purely informational to enabling the regeneration of the data, possibly with different parameters. However, while solving a particular problem, scientists often create several variations of a workflow in a trial-and-error process. These workflows may differ both in the parameter values used and in their specifications. If only the provenance of individual results is maintained, useful information about the relationship among the workflows is lost. In addition, since a lot of expert knowledge is involved in the exploratory process, the change history of both data (e.g., input parameters) and processes contains important knowledge that can potentially be extracted and re-used to solve an array of problems. To the best of our knowledge, VisTrails is the first system to provide support for tracking workflow evolution. In their pioneering work on the GRASPARC system, Brodlie et al. [4] proposed the use of a history mechanism that allowed scientists to steer an ongoing simulation by backtracking a few steps, changing parameters, and resuming execution. However, their focus was on steering time-dependent simulations, not on data exploration. Derthick and Roth [10] introduced a tree-like structure for branching history for pre-defined

4 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 4 operations that consist of queries over a database. This work did not provide a formal model for actions, nor did address the notion of a general workflow as in VisTrails. Kreuseler et al. [20] proposed a history mechanism for exploratory data mining. They use a tree-structure, similar to a vistrail, to represent the change history, and describe how undo and redo operations can be calculated in this tree structure. They describe a theoretical framework that attempts to capture the complete state of a software system. In contrast, VisTrails only tracks the evolution of the workflows and this allows for the much simpler action-based provenance mechanism described in Section IV. The issue of provenance has also been addressed in systems that provide data management infrastructure for scientific applications [12], [23]. Frew and Bose [12] described a mechanism that logs information about workflow executions in the Earth System Science Workbench (ESSW). The logged information allows them, for example, to infer linkages between science objects. Such information can also be obtained from a vistrail. It is not clear, however, whether it is possible to infer from the ESSW logs how a particular workflow evolved over time. GridDB [23] is a data-centric overlay for scientific grid data analysis. By modeling the input parameters and outputs of workflow modules as relations (the relational cover of a workflow), it provides a declarative interface for steering grid computations and for exploring parameter sets. Similarly, other systems recently introduce track the parameters changed during the user s exploration process. Groth and Streefker [14] capture viewing manipulations of a visualization and allow annotations on these explorations. Tory et al. [31] provide a parallel coordinates style interface for exploring the parameter space and track the parameters changed and display it as a history bar of the resulting images. Finally, Jankun- Kelly and Ma [17] provide a foundation for tracking visualization parameters with a goal of encapsulating the exploration process for visualization. However, unlike VisTrails, these systems do not capture workflow evolution, they only maintain data provenance a given visualization can be traced back to entities of the relational cover (the subset of data represented with relations). III. VISTRAILS: OVERVIEW Our motivation for developing VisTrails came from the realization that visualization systems lack adequate support for large-scale data exploration. VisTrails is a visualization management system that manages both the process and metadata associated with visualizations. With VisTrails, we aim to give scientists a dramatically Fig. 2. VisTrails Architecture. improved and simplified process to analyze and visualize large ensembles of simulations and observed phenomena. A. System Architecture The high-level architecture of the system is shown in Figure 2. Users create and edit dataflows using the Vistrail Builder user interface. The dataflow specifications are saved in the Vistrail Repository. Users may also interact with saved dataflows by invoking them through the Vistrail Server (e.g., through a Web-based interface) or by importing them into the Visualization Spreadsheet. Each cell in the spreadsheet represents a view that corresponds to a dataflow instance; users can modify the parameters of a dataflow as well as synchronize parameters across different cells. Dataflow execution is controlled by the Vistrail Cache Manager, which keeps track of invoked operations and their respective parameters. Only new combinations of operations and parameters are requested from the Vistrail Dataflow Execution Engine, which executes the operations by invoking the appropriate modules from the underlying libraries (e.g., VTK, ITK, matplotlib). The execution engine also interacts with the Optimizer module, which analyzes and optimizes the dataflow specifications. A log of dataflow executions is kept in the Vistrail Log. The different components of the system are briefly described below. A more detailed description of an earlier version of our system is given in previous work [2]. Dataflow Specifications. A key feature that distinguishes VisTrails from previous visualization systems is that it separates the notion of a dataflow specification from its instances. A dataflow instance consists of a sequence of operations used to generate a visualization. This information serves both as a log of the steps followed to generate a visualization a record of the visualization provenance and as a recipe to automatically regenerate the visualization at a later time. The steps can be replayed exactly as they were first executed, and they can also be used as templates they can be parameterized. For example, the visualization spreadsheet in Figure 3(b)

5 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 5 (a) (b) Fig. 3. The Vistrail Builder (a) and Visualization Spreadsheet (b) showing the dataflows that derive visualizations of fresh-water discharge from the Columbia River for different time steps retrieved from the web. illustrates a multi-view visualization of a single dataflow specification where the file name parameter is varied in several of the spreadsheet cells. Operations in a vistrail dataflow include visualization operations (e.g., VTK calls); application-specific steps (e.g., invoking a simulation script); and general file manipulation modules (e.g., transferring files between servers). To handle the variability in the structure of different kinds of operations, and to easily support the addition of new operations, we defined a flexible XML schema [5] to represent the dataflows. The schema captures all information required to re-execute a given dataflow. The schema stores information about individual modules in the dataflow (e.g., the function executed by the module, input and output parameters) and their connections how outputs of a given module are connected to the input ports of another module. The XML representation for vistrail dataflows allows the reuse of standard XML tools and technologies. An important benefit of using an open, self-describing specification is the ability to share (and publish) dataflows. This allows a scientist to publish an image along with its associated dataflow so that others can easily reproduce the results. Another benefit of using XML is that the dataflow specification can be queried using standard XML query languages such as XPath [36] and XQuery [3]. For example, an XQuery query could be used in the context of the CORIE project to locate a dataflow that provides a 3D visualization of the salinity at the Columbia River estuary (as in Figure 3) from a database of published dataflows. Once the dataflow is found, he could then apply the same dataflow to more current simulation results, or modify the dataflow to test an alternative hypothesis. Caching, Analysis and Optimization. Having a highlevel specification allows the system to analyze and optimize dataflows. Executing a dataflow can take a long time, especially if large data sets and complex visualization operations are used. It is thus important to be able to analyze the specification and identify optimization opportunities. Possible optimizations include, for example, factoring out common subexpressions that produce the same value; removing no-ops; identifying steps that can be executed in parallel; and identifying intermediate results that should be cached to minimize execution time. Although most of these optimization techniques are widely used in other areas, they have yet to be applied in dataflow-based visualization systems. In our current VisTrails prototype, we have implemented memoization (caching). VisTrails leverages the dataflow specification to identify and avoid redundant operations. The algorithms and implementation of the Vistrail Cache Manager (VCM) are described in previous work [2]. Caching is especially useful while exploring multiple visualizations. When variations of the same dataflow need to be executed, substantial speedups can be obtained by caching the results of overlapping subsequences of the dataflows. Running a Dataflow. The Vistrail Dataflow Execution Engine receives as input an a dataflow instance and executes it by invoking the appropriate libraries. Currently,

6 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 6 Fig. 4. A snapshot of the VisTrails history management interface. Each node in the vistrail history tree represents a dataflow version. An edge between a parent and child nodes represents a set of actions applied to the parent to obtain the dataflow for the child node. VisTrails supports VTK, ITK, HTTP, SciPy, ImageMagick, Web services, and Matplotlib among others. New packages are easily ported and are frequently added. 4 VisTrails provides a plug-in mechanism that greatly simplifies the addition of new dataflow modules. To add a new method, a Python [34] wrapper needs to be created that defines the input and output parameters, as well as specifies how to invoke the method. Details on the plugin wrapping process are available on the VisTrails project Web site. 5 The Execution Engine is unaware of caching. To accommodate caching in the execution engine, we use a class that behaves essentially like a proxy. To the rest of the dataflow, it looks perfectly like a filter, but instead of performing computations, it simply looks up the result in the cache. The VCM is responsible for replacing a complex subnetwork that has been previously executed with an appropriate instance of the caching class. After the (partial) dataflow is executed, its outputs are stored in new cache entries. Information pertinent to the execution of a particular dataflow instance is kept in the Vistrail Log (see Figure 2). There are many benefits of keeping this information. One of these benefits is the ability to debug the application e.g., it is possible to check the results of a dataflow using simulation data against sensor data. Another important benefit is that it reduces the cost of failures if a visualization process fails, it can be restarted from the failure point. The latter is especially useful for long running processes, as it may be very expensive and time-consuming to execute the whole process from scratch. Logging all the information associated with all dataflows may not be feasible. VisTrails provides an interface that lets users select which and how much information should be saved. Creating and Interacting with Vistrails. The Vistrail Builder (VB) allows users to create and edit dataflows (see Figure 3(a)). It reads and writes dataflows in the same XML format as the other components of the system. It shares the familiar nodes-and-connections paradigm with dataflow systems. To automatically generate the visual representation of the modules, it reads the same data structure used by plug-in wrappers. Like the Execution Engine, the VB requires no change to support additional modules. The VisTrails Visualization Spreadsheet allows users to compare the results of multiple dataflows. It consists of a set of separate visualization windows arranged in a tabular view. This layout makes efficient use of screen space, and the row/column groupings can conceptually help the user explore the visualization parameter space [8], [9] (see Section V-C). The cells in a spreadsheet may execute different dataflows and they may also use different parameters for the same dataflow specification (see Figure 3). To ensure efficient execution, all cells share the same cache. Note that cells in a spreadsheet can be synchronized in many different ways. For example, in Figure 7, cells are synchronized with respect to the camera viewpoint if one cell is rotated, the same rotation is applied to the other synchronized cell. This figure also illustrates the usefulness of the spreadsheet to explore the parameter space of an application. In this case, it allows different visualizations of MRI data of a head to be compared: different Alpha parameters of an image filter used in the horizontal axis, and different Beta parameters are used in the vertical axis.

7 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 7 IV. ACTION-BASED PROVENANCE VisTrails uses an action-based model to capture provenance. As the scientist makes modifications to a particular dataflow, the provenance mechanism records those changes. Instead of storing a set of related dataflows, we store the operations or actions that are applied to the dataflows (e.g., the addition of a module, the modification of a parameter, etc.). This representation is both simple and compact it uses substantially less space than the alternative of storing multiple versions of a dataflow. In addition, it enables the construction of an intuitive interface that allows scientists to both understand and interact with the history of the dataflow through these changes, as illustrated in Figure 4. A tree-based view allows a scientist to return to a previous version in an intuitive way, to undo bad changes, to compare different dataflows, and to be reminded of the actions that led to a particular result. This, combined with a caching strategy that eliminates redundant computations, allows the scientist to efficiently explore a large number of related visualizations. VisTrails is the first system to provide a mechanism that uniformly captures provenance information for both the data and the processes that derive the data. Along with the benefits of previously proposed provenance mechanisms, such as reproducibility and documentation, the action-based model has a number of additional benefits in the context of data exploration. As we discuss in Section V, it enables operations that streamline the data exploration process. Vistrail: An Evolving Dataflow. A vistrail V T is a rooted tree in which each node corresponds to a version of a dataflow, and each edge between nodes d p and d c, where d p is the parent of d c, corresponds to the action applied to d p which generated d c. This is similar to the versioning mechanism used in Darcs [26]. The vistrail in Figure 4 shows a set of changes to a dataflow that was used to generate some CORIE visualizations similar to those shown in Figure 3. In this case, a dataflow to visualize the salinity in a small section of the estuary. This dataflow (tagged With Text ) was used to create four different dataflows that represent different time steps of the data. More formally, let DF be the domain of all possible dataflow instances, where /0 DF is a special empty dataflow. Also, let x : DF DF be a function that transforms a dataflow instance into another, and D be the set of all such functions. A vistrail node corresponding to a dataflow d is constructed by composing a sequence of actions, where each x i D: d = (x n (x n 1... (x 1 (/0))...)) type Vistrail = Action*, annotation? ] type Action = tag?, annotation?, (addmodule deletemodule replacemodule addconnection deleteconnection setparameter changeparameter)] Fig. 5. Excerpt of the vistrail schema. In what follows, we use the following notation: we represent dataflows as d p, and if a dataflow d c is created by applying a sequence of actions on d p, we say that d p < d c (i.e., the vistrail node d c is a descendant of d p ). In addition, /0 VT, and d n VT, if d n /0, then /0 < d n. An excerpt of the vistrail XML schema is shown in Figure 5. For simplicity of the presentation, we only show a subset of the schema and use a notation less verbose than XML Schema. A vistrail has a unique ID, a name, an optional annotation, and a set of actions. Each action is uniquely identified by an ID (@id), the ID of its parent (@parentid), its creator (@user), and the time of creation (@date). The different actions we have implemented in our current prototype include: adding, deleting and replacing dataflow modules; adding and deleting connections; and setting parameter values. To simplify the retrieval of particularly interesting versions, a node may have a name (the optional attribute tag in the schema) as well as additional notes from the creator (annotations). Note that the action-based provenance provides a very concise representation of the exploratory process. Figure 1 shows an example of the history captured during a single user s exploration of isosurface extraction techniques for several days. The user constructed several visualizations and tagged only the more important milestones in the exploration process. Although the complete vistrail contains 900 actions, it only requires only 25 KB of disk space when compressed far less space than the actual dataset being visualized, or even if the dataflow specifications of tagged versions were saved. V. DATA EXPLORATION THROUGH PIPELINE MANIPULATIONS Capturing provenance by recording changes applied to dataflows has benefits in both uniformity and compactness of representation. In addition, it allows powerful data exploration operations through direct manipulation of the history tree. Due to space limitations, we only describe a few of these operations. First, we describe our interface for interacting with provenance to gain a better understanding of the exploration process than a single visualization specification provides. Second, we show that

8 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 8 stored actions lend themselves very naturally to reuse through an intuitive macro mechanism. Third, we describe a bulk-update mechanism that allows the creation of a large number of visualizations of an n-dimensional slice of the parameter space of a dataflow. Finally, we describe how the differences between dataflows can be efficiently computed in the action-based model. A. Interacting with Provenance Figures 1, 3(a), and 4 show different history trees. Each node in a history tree corresponds to one instance of a pipeline. An edge between two nodes represents one action that occured between pipelines. To avoid visual clutter, in addition to a full tree view, we provide a default condensed view of the tree. The condensed view shows only the nodes that the user deems significant and assigns a tag. Since a tag is only metadata, it can easily be changed or even removed, thus changing the representation of the tree. A series of actions between tagged nodes is represented by three perpendicular lines that can be expanded to the full representation with a double click. As described above, additional metadata (e.g., annotations, creator, and date of creation) is recorded with each action and is displayed with the history tree. Annotations can be provided by the user to facilitate searching, clarify the exploration process, and provide the additional information that may be needed for a complete audit trail. When a node is selected, the metadata is displayed in the right panel of the history tree viewer, as shown in Figure 1 (left). In addition, the creator and date information are incorporated as visual cues in the history tree. The color of the nodes denotes the creator (blue nodes were created by the user and orange nodes were created by someone else) and the saturation of the node denotes chronology (more recent nodes are more saturated). These cues, along with the tags and annotations, allow a user to open a new vistrail and quickly understand the visualization process. Inevitably, as a user is creating and exploring visualizations, mistakes will be made or the tree will grow too large. Thus, we provide pruning operations that hide the extraneous nodes (no provenance information is actually deleted). Pruning the tree can currently be performed in two ways. The first, and simplest, pruning operation is to manually select the nodes to be removed and press the delete key. The second pruning operation is performed through a search and refine interface. A search will highlight only the nodes that match a query, whereas a refine will collapse the history tree to contain only the matching nodes. Both search and refine features Fig. 6. A vistrail macro consists of a sequence of change actions which represent a fragment of a dataflow. Here we show an example of a series of actions that occur on one version that can be applied to other similar versions using a macro. use a keyword-based text input that is provided in the interface. Queries over modules, parameters, tags, annotations, users, and dates can be performed directly by directly typing words or by using a simple query interface. For example, a version created by stevec in March that uses a vtkrenderer could be queried as user:stevec date:mar vtkrenderer. These features are especially useful in a collaborative setting. B. Reusing Stored Provenance Visualization pipelines are complex and require deep knowledge of visualization techniques and libraries (e.g., VTK [27]). Even for experts, creating large pipelines is time-consuming. As complex visualization pipelines contain many common tasks, mechanisms that allow the re-use of pipelines or pipeline fragments are key to streamlining the visualization process. Figure 6 shows a simplified example of a vistrail that describes the steps for a pipeline that reads a dataset, applies a colormap, creates a scalar bar, and finally, adds some text before rendering the images. Since the last three steps are needed for each time step, they can be made into a macro re-used in all these pipelines. The action-based model leads to a natural representation for a macro. A macro is just a sequence of actions x j x j 1... x i Inserting a macro into a vistrail node d i is also conceptually simple. The sequence of actions in the macro is composed with the actions in node d i : (x j x j 1... x i ) d i There are several possible ways of defining a macro. The simplest is to do so directly on the history tree, by specifying a pair of versions d j and d i, where d i < d j,

9 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 9 Fig. 7. Visualization Spreadsheet. Parameters for a vistrail loaded in a spreadsheet cell can be interactively modified by clicking on the cell. Cameras as well as other vistrail parameters for different cells can be synchronized. This spreadsheet contains visualizations of MRI data of a head and was generated procedurally with the parameter exploration interface shown on the right. The horizontal axis varies one parameter of an image filter and the vertical axis explores another. The interface allows the user to manage the layout of cells when multiple visualizations are produced by one pipeline as shown above with the two views (labeled 1 and 2). and define the macro as the sequence of actions that takes d i into d j. The more common way of defining a macro is by interactively selecting a set of modules and connections in a given pipeline, so that these modules can be re-created in a different pipeline (i.e., copy and paste). Although a macro corresponds to a set of modules and connections in a dataflow, it is represented as a set of actions. Since dataflows are not stored directly, it is necessary to identify the actions in a vistrail that create and modify the modules and connections selected. However, between the first action x i and the last action x j, there might be actions that change parts of the pipeline that were not selected. These potentially irrelevant actions must be removed. In the current implementation, we perform a simple analysis and remove actions that are applied to modules that are not created between x i and x j. More complex, application-dependent analyses are possible. When a user defines a macro, some of the actions may create new dataflow modules and connections. Even though some of the connections will be between modules created inside the macro, some may connect to previously existing modules. Thus, mechanisms are needed to identify modules that match these dangling connections in a dataflow. To aid in the identification of matching modules when a macro is applied, we store the external modules of the original dataflow from which the macro was extracted along with the macro. We are currently investigating alternative strategies for automatically locating candidate matches to better guide the user in selecting the appropriate modules. C. Scalable Derivation of Visualizations The action-oriented model also leads to a very natural means to script dataflows. The parameter space of a dataflow d, denoted by P(d), is the set of dataflows that can be generated by changing the value of parameters of d. From x k,,x i, we can derive P(d) by tracking addmodule and deletemodule actions, and knowing P(m), the parameter space of module m, for each module in the dataflow. Each parameter can then be thought of as a basis vector for the parameter space of d. Using the action-based model, instances for the basis vectors can be produced by a sequence of setparameter actions. Dataflows spanning an n-dimensional subspace of P(d) are generated as follows: setparameter(id n,value n ) setparameter(id 1,value 1 ) d The ability to apply bulk updates over a set of distinct dataflows greatly simplifies the exploration of the parameter space for a given task and provide an effective means to create a large number of visualizations. Our bulk-update interface is accessed through a tab in the VisTrail Builder (see Figure 7). The parameters that have been set throughout the pipeline are displayed in the right column. All the parameters the user desires to explore are dragged into the main window. Our interface allows the user to select one of four dimensions for a parameter to be mapped to in the spreadsheet: horizontal rows, vertical columns, multiple sheets, and animations through time in a single cell. Additionally, there are several options for defining the values to explore. The

10 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS first is a simple interpolation that occurs between a start and end parameter and is most commonly used for simple integer or floating point values. The second is a user-specified list that can be used for any data type. This option is intended for exploring strings such as file names or labels. Finally, the most flexible option provides an interface for creating a Python function that defines the values. This option is intended for providing non-linear sequences of values for exploration. Mapping bulk updates to a spreadsheet is relatively straightforward when the pipeline produces a single visualization. However, this is not always the case, as shown by the different slice axes used in Figure 7. We provide an interface that allows the user to specify the spreadsheet layout of multiple visualizations during bulk updates. An annotated snapshot of the pipeline is displayed that shows a number on each module that appears multiple times in the pipeline. For example, in the figure, several modules are repeated twice and are thus assigned a 1 or a 2 to distinguish them in the explorations. The layout of the outputs are performed using a virtual cell interface that allows annotated cells to be stacked horizontally and vertically (shown on the bottom right of the figure). This gives the user the ability to manage multiple visualizations efficiently. Note that since parameter value modifications and dataflow modifications are captured uniformly by the action-based provenance, a similar mechanism can be used to explore spaces of different dataflow definitions. For example, if a scientist needs to compare visualizations generated by different algorithms, she can create a series of visualizations using the first algorithm, and then apply a bulk update that creates new dataflows by replacing the occurrences of the first algorithm with the second and keeping all the other modules and parameter settings intact. Together with the visualization spreadsheet provided by VisTrails, the parameter exploration mechanism allows users to effectively compare different visualizations. In addition, since VisTrails identifies and avoids redundant operations, dataflows generated from parameter explorations can be executed efficiently the operations that are common to the set of affected dataflows need only be executed once. D. Computing Pipeline Differences The visualization discovery process requires many trial-and-error steps, and it is not uncommon for tens to hundreds of different pipelines to be generated in the course of a study. In such a scenario, it becomes important to understand what the differences between 10 Fig. 8. Computing the differences between two dataflows that derive visualizations of CT data of a lung containing pathological tissue. In VisTrails, users can select nodes (dataflows) in the vistrail tree to be compared. Here, the user selected the dataflows labeled Raycasted and Texture Mapped. Their corresponding visualizations are shown on the right and the differences between the two dataflows are shown in the bottom. Modules shown in blue are present only in Texture Mapped, orange modules are present only in RayCasted, dark grey modules are present in both dataflows, and light-grey modules have different parameter values in the two versions as shown on the bottom left for the vtkcolortransferfunction module. pipelines are, especially if multiple people are collaboratively exploring the data. Computing the differences between two pipelines by considering their underlying graph structure turns out to be impractical. In fact, the related decision problem of subgraph isomorphism (or matching) is known to be NP-complete [13]. This problem becomes much easier using the actionbased provenance model. Let d1 and d2 be two vistrail nodes corresponding to the pipelines P1 and P2 respectively: d1 = xk xk 1 x1 0/ d2 = yi yi 1 y1 0/ Since all the vistrails nodes are represented by paths in a rooted tree, they all share at least one common ancestor. Let c = xm = y j = lca(xk, yi ) be the least common ancestor of xk and yi. The common part of d1 and d2 is the path from the root of the tree down to node c, i.e., {c 0}. / The difference computation is simple and efficient it can be done in linear time: after obtaining the least common ancestor of the nodes, the difference d2 d1 is represented by two lists of actions: {xk... xm+1 } and {yi... y j+1 }. Using these lists, we identify the modules that are unique in the two pipelines, and for modules that are present in both pipelines, we identify

11 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 11 their differences by examining setparameter actions. For display, we divide the nodes into four categories: shared nodes, shared nodes with parameter changes, nodes that belong only to P 1, and nodes that belong only to P 2. Edges are divided into shared edges, and edges that belong to P 1 or P 2 respectively. A remaining issue is how to handle nodes that have been deleted and re-inserted to a given dataflow. We handle this case by inspecting module names and function signatures. Figure 8 shows the visual difference interface provided by VisTrails. A visual difference is enacted by dragging one node in the history tree onto another, which opens a new window with a difference pipeline. The difference pipeline displays modules unique to the first node in orange, modules unique to the second node in blue, modules that are the same in dark grey, and modules that have different parameter values in light grey. The parameter changes are displayed in a subwindow when a light grey module is selected. Using this interface, users can easily correlate differences between two visualizations with differences between their corresponding specifications. VI. EVALUATION Source code, binaries, and documentation for VisTrails are available at the project web page 6. In only a few months, the system has been downloaded over 400 times and is currently being used in several scientific settings. To evaluate the utility of VisTrails in these settings, we performed an informal expert review. Expert reviews have been shown to be a useful means of evaluation, and require fewer reviewers than standard user studies [30]. We collected comments and suggestions from four different users that have been using VisTrails in their research. A summary of the projects and comments are provided below. Review 1: Ocean Simulations. VisTrails is currently being used to jump-start 3D visualization capabilities for a large-scale ocean circulation simulations for the CORIE project. According to our reviewer, oceanographers traditionally eschew 3D visualizations due to the difficulty in tuning and interpreting the increased number of parameters that are a result of increased dimensionality. The parameter exploration capabilities of VisTrails are being used to overcome this hurdle. This functionality is used not only to adjust characteristics of the visualizations, but also to select the data to be visualized. According to the reviewer, this technique allows VisTrails to transcend its role as a visualization 6 tool and become a query interface to our data repository. In addition, the provenance and execution mechanism in VisTrails is being used for a custom querying interface for accessing and manipulating simulation results. With VisTrails, these query patterns can be developed, tweaked, stored, and reused flexibly: they can be called with the convenience of a function, but inspected visually as the algebraic expression that they are. Previous to VisTrails, the group used scripts to create visualizations and had no means of managing the creation and maintanance process. Review 2: Mesh Quality Comparisons. Our second reviewer uses VisTrails to perform comparative analysis of isosurface extraction algorithms (see Figure 1). These comparisons typically involve images of the resulting mesh along with several histograms that display quality information of the mesh. At least three unique software packages are used just to create the different visualizations. This task requires frequent workflow modifications and substantial parameter tuning during the analysis process. Before VisTrails, the reviewer managed the data and software manually with scripts. According to the reviewer, the fact that Vistrails has a supportive provenance system allows me to explore different alternatives with more freedom, since I know that the different alternatives that I pursued before can be easily recovered. He also had positive feedback about the interface: navigation is easy and querying or exploring different pipelines is done in a nice and intuitive way. As areas of improvement, the reviewer suggested that the system should include more automatic way for clearing unwanted branches in the version tree. Overall, the reviewer was very positive and mentioned that he intends to use VisTrails in future projects. Review 3: Experimental High Energy Physics. Vis- Trails is currently being used to develop a workflow management system for performing Experimental High Energy Physics (HEP) at a well known research institution. In particular, the pipeline builder and provenance capturing mechanism are used extensively. Due to the flexibility of the system, it was easily adapted to include new modules, change the execution model, and add a custom scientific notebook for the results. The new execution model allows for asynchronous processing of the workflow with incremental updates of the results in order to give fast feedback to the physicist. The custom modules correspond to fine-grain programming structures (i.e., for loops and if statements). This fine granularity allows the use of VisTrails provenance tracking to record all the filtering changes typically done during an HEP analysis. The reviewer commented:

12 SUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 12 I feel tracking of provenance is an essential part of scientific research and the HEP field does not have any tools to aid with provenance tracking. However, to me provenance tracking is about what you did while what VisTrails does is even better, which is replay what you did and use that as a starting point for new changes. It is the latter ability which enticed me to use VisTrails. The reviewer also made some interesting suggestions for future versions of VisTrails: providing a mechanism for versioning external packages; allowing the user to assign importance to named versions in the history tree; and providing conventional undo/redo options during pipeline creation. The reviewer also noted that he evaluated the use of other systems before VisTrails, but quickly abandoned them due to the lack of provenance tracking. Review 4: Memory Studies. VisTrails is being used as a primary tool to study the effect of Repetitive Transcranial Magnetic Stimulation on the working memory centers of the brain. The study manages and analyzes MRI, EEG, and MEG data for many test subjects (see Figure 9). As the amount of data grows, it is increasingly important to track the changes to the analysis methodologies as well as the resulting visualizations. VisTrails provides a seamless method for doing this as any change to the data analysis pipeline is automatically recorded. According to the reviewer: provenance in general is not something that may be taken away from the scientific process. Without reproducible results, there is no science. For feedback, the reviewer suggested that the search and refine features of the history tree are useful, but the system should allow the user to hide extraneous nodes in the history tree as their importance changes. The reviewer also suggested that the ability to create custom widgets (i.e., a transfer function editor) would be useful for some applications. The reviewers provided important feedback and suggestions that will be used to improve the system. The suggestions of the reviewers are currently being incorporated into VisTrails and will be included in future releases of the software. In fact, some of the pruning functionality described in Section V was recently added to address reviewers concerns about removing unwanted nodes in the history tree. VII. CONCLUSION In this paper we propose a new infrastructure for streamlining data exploration through visualization. The infrastructure leverages a new action-based provenance mechanism to provide useful features that allow effective exploration of visualizations over large parameter spaces. Fig. 9. An example of the history tree, pipeline, and spreadsheet used for visualizing measured data for a single patient in a memory study. This pipeline utilizes the VisTrail execution model to combine VTK, ITK, and Matplotlib libraries. The provenance information captured by VisTrails can be used to augment existing scientific data repositories with the process that scientists go through to generate and analyze data. For example, a scientist can publish the vistrail used to generate the images in a paper. This has the obvious benefit of allowing scientific results to be reproduced. In addition, since substantial expert knowledge is involved in the exploratory process, having this information creates the opportunity for the development of mining techniques that extract knowledge, in the form of exploratory patterns, which can be used to solve an array of problems. We are currently applying VisTrails to a number of different problem areas, from environmental sciences and medical imaging, to computer-aided drug design and discovery. Our initial experiences have confirmed that the VisTrails data exploration infrastructure greatly simplifies the scientific discovery process, and that it indeed allows scientists to more effectively explore vast amounts of data. This indicates that appropriate management of both data and processes involved in visualization has the potential to substantially increase the impact of visualization in scientific discovery. Although VisTrails was originally designed to support data exploration through visualization, ideas developed in the context of VisTrails have been successfully applied to scientific workflow systems in different domains. For example, algorithms developed for VisTrails have also been used in Kepler, a general scientific workflow system [24]; and Taverna, workflow workbench developed as part of the UKs my Grid project, has recently added support for tracking the evolution path of workflows [37].

Managing the Evolution of Dataflows with VisTrails Extended Abstract

Managing the Evolution of Dataflows with VisTrails Extended Abstract Managing the Evolution of Dataflows with VisTrails Extended Abstract Steven P. Callahan Juliana Freire Emanuele Santos Carlos E. Scheidegger Cláudio T. Silva Huy T. Vo University of Utah vistrails@sci.utah.edu

More information

Managing Exploratory Workflows

Managing Exploratory Workflows Managing Exploratory Workflows Juliana Freire Claudio T. Silva http://www.sci.utah.edu/~vgc/vistrails/ University of Utah Joint work with: Erik Andersen, Steven P. Callahan, David Koop, Emanuele Santos,

More information

Using Provenance to Streamline Data Exploration through Visualization

Using Provenance to Streamline Data Exploration through Visualization 1 Using Provenance to Streamline Data Exploration through Visualization S.P. Callahan, J. Freire, E. Santos, C.E. Scheidegger, C.T. Silva, and H.T. Vo UUSCI-2006-016 Scientific Computing and Imaging Institute

More information

Managing Rapidly-Evolving Scientific Workflows

Managing Rapidly-Evolving Scientific Workflows Managing Rapidly-Evolving Scientific Workflows Juliana Freire, Cláudio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo University of Utah Abstract. We give an overview

More information

Managing the Evolution of Dataflows with VisTrails

Managing the Evolution of Dataflows with VisTrails Managing the Evolution of Dataflows with VisTrails Juliana Freire http://www.cs.utah.edu/~juliana University of Utah Joint work with: Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, Claudio

More information

Visualization in Radiation Oncology: Towards Replacing the Laboratory Notebook

Visualization in Radiation Oncology: Towards Replacing the Laboratory Notebook Visualization in Radiation Oncology: Towards Replacing the Laboratory Notebook Erik W. Anderson 1 Emanuele Santos 2 Steven P. Callahan 2 George T. Y. Chen 3,4 Carlos E. Scheidegger 2 Cla udio T. Silva

More information

A Process-Driven Approach to Provenance-Enabling Existing Applications

A Process-Driven Approach to Provenance-Enabling Existing Applications A Process-Driven Approach to Provenance-Enabling Existing Applications Steven P. Callahan 1,2, Juliana Freire 1,2, Carlos E. Scheidegger 2, Cláudio T. Silva 1,2, and Huy T. Vo 2 1 VisTrails, Inc. 2 Scientific

More information

User s Guide. Version 1.0

User s Guide. Version 1.0 User s Guide Version 1.0 September 17, 2007 ii This is the preface. Contents 1 What Is VisTrails? 1 2 Getting Started 3 2.1 Installation...................................... 3 2.2 Quick Start......................................

More information

Scientific Visualization

Scientific Visualization Scientific Visualization Topics Motivation Color InfoVis vs. SciVis VisTrails Core Techniques Advanced Techniques 1 Check Assumptions: Why Visualize? Problem: How do you apprehend 100k tuples? when your

More information

Automatic Generation of Workflow Provenance

Automatic Generation of Workflow Provenance Automatic Generation of Workflow Provenance Roger S. Barga 1 and Luciano A. Digiampietri 2 1 Microsoft Research, One Microsoft Way Redmond, WA 98052, USA 2 Institute of Computing, University of Campinas,

More information

Tackling the Provenance Challenge One Layer at a Time

Tackling the Provenance Challenge One Layer at a Time CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE [Version: 2002/09/19 v2.02] Tackling the Provenance Challenge One Layer at a Time Carlos Scheidegger 1, David Koop 2, Emanuele Santos 1, Huy Vo 1, Steven

More information

Pedigree Management and Assessment Framework (PMAF) Demonstration

Pedigree Management and Assessment Framework (PMAF) Demonstration Pedigree Management and Assessment Framework (PMAF) Demonstration Kenneth A. McVearry ATC-NY, Cornell Business & Technology Park, 33 Thornwood Drive, Suite 500, Ithaca, NY 14850 kmcvearry@atcorp.com Abstract.

More information

Logi Info v12.5 WHAT S NEW

Logi Info v12.5 WHAT S NEW Logi Info v12.5 WHAT S NEW Introduction Logi empowers companies to embed analytics into the fabric of their organizations and products enabling anyone to analyze data, share insights, and make informed

More information

Project Name. The Eclipse Integrated Computational Environment. Jay Jay Billings, ORNL Parent Project. None selected yet.

Project Name. The Eclipse Integrated Computational Environment. Jay Jay Billings, ORNL Parent Project. None selected yet. Project Name The Eclipse Integrated Computational Environment Jay Jay Billings, ORNL 20140219 Parent Project None selected yet. Background The science and engineering community relies heavily on modeling

More information

Provenance-aware Faceted Search in Drupal

Provenance-aware Faceted Search in Drupal Provenance-aware Faceted Search in Drupal Zhenning Shangguan, Jinguang Zheng, and Deborah L. McGuinness Tetherless World Constellation, Computer Science Department, Rensselaer Polytechnic Institute, 110

More information

AADL Graphical Editor Design

AADL Graphical Editor Design AADL Graphical Editor Design Peter Feiler Software Engineering Institute phf@sei.cmu.edu Introduction An AADL specification is a set of component type and implementation declarations. They are organized

More information

COPYRIGHTED MATERIAL. Making Excel More Efficient

COPYRIGHTED MATERIAL. Making Excel More Efficient Making Excel More Efficient If you find yourself spending a major part of your day working with Excel, you can make those chores go faster and so make your overall work life more productive by making Excel

More information

Improved Database Development using SQL Compare

Improved Database Development using SQL Compare Improved Database Development using SQL Compare By David Atkinson and Brian Harris, Red Gate Software. October 2007 Introduction This white paper surveys several different methodologies of database development,

More information

CHAPTER 1 COPYRIGHTED MATERIAL. Finding Your Way in the Inventor Interface

CHAPTER 1 COPYRIGHTED MATERIAL. Finding Your Way in the Inventor Interface CHAPTER 1 Finding Your Way in the Inventor Interface COPYRIGHTED MATERIAL Understanding Inventor s interface behavior Opening existing files Creating new files Modifying the look and feel of Inventor Managing

More information

Features and Benefits

Features and Benefits AutoCAD 2005 Features and s AutoCAD 2005 software provides powerhouse productivity tools that help you create single drawings as productively as possible, as well as new features for the efficient creation,

More information

"Charting the Course... MOC C: Developing SQL Databases. Course Summary

Charting the Course... MOC C: Developing SQL Databases. Course Summary Course Summary Description This five-day instructor-led course provides students with the knowledge and skills to develop a Microsoft SQL database. The course focuses on teaching individuals how to use

More information

Business Insight Authoring

Business Insight Authoring Business Insight Authoring Getting Started Guide ImageNow Version: 6.7.x Written by: Product Documentation, R&D Date: August 2016 2014 Perceptive Software. All rights reserved CaptureNow, ImageNow, Interact,

More information

Visualization Architecture for User Interaction with Dynamic Data Spaces in Multiple Pipelines

Visualization Architecture for User Interaction with Dynamic Data Spaces in Multiple Pipelines Visualization Architecture for User Interaction with Dynamic Data Spaces in Multiple Pipelines S. Prabhakar 6444 Silver Avenue, Apt 203, Burnaby, BC V5H 2Y4, Canada Abstract Data intensive computational

More information

Tackling the Provenance Challenge One Layer at a Time

Tackling the Provenance Challenge One Layer at a Time CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE [Version: 2002/09/19 v2.02] Tackling the Provenance Challenge One Layer at a Time Carlos Scheidegger 1, David Koop 2, Emanuele Santos 1, Huy Vo 1, Steven

More information

KNIME Enalos+ Modelling nodes

KNIME Enalos+ Modelling nodes KNIME Enalos+ Modelling nodes A Brief Tutorial Novamechanics Ltd Contact: info@novamechanics.com Version 1, June 2017 Table of Contents Introduction... 1 Step 1-Workbench overview... 1 Step 2-Building

More information

Tips and Guidance for Analyzing Data. Executive Summary

Tips and Guidance for Analyzing Data. Executive Summary Tips and Guidance for Analyzing Data Executive Summary This document has information and suggestions about three things: 1) how to quickly do a preliminary analysis of time-series data; 2) key things to

More information

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value KNOWLEDGENT INSIGHTS volume 1 no. 5 October 7, 2011 Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value Today s growing commercial, operational and regulatory

More information

Microsoft. [MS20762]: Developing SQL Databases

Microsoft. [MS20762]: Developing SQL Databases [MS20762]: Developing SQL Databases Length : 5 Days Audience(s) : IT Professionals Level : 300 Technology : Microsoft SQL Server Delivery Method : Instructor-led (Classroom) Course Overview This five-day

More information

Business Intelligence and Reporting Tools

Business Intelligence and Reporting Tools Business Intelligence and Reporting Tools Release 1.0 Requirements Document Version 1.0 November 8, 2004 Contents Eclipse Business Intelligence and Reporting Tools Project Requirements...2 Project Overview...2

More information

Content Management for the Defense Intelligence Enterprise

Content Management for the Defense Intelligence Enterprise Gilbane Beacon Guidance on Content Strategies, Practices and Technologies Content Management for the Defense Intelligence Enterprise How XML and the Digital Production Process Transform Information Sharing

More information

RINGS : A Technique for Visualizing Large Hierarchies

RINGS : A Technique for Visualizing Large Hierarchies RINGS : A Technique for Visualizing Large Hierarchies Soon Tee Teoh and Kwan-Liu Ma Computer Science Department, University of California, Davis {teoh, ma}@cs.ucdavis.edu Abstract. We present RINGS, a

More information

XML-based production of Eurostat publications

XML-based production of Eurostat publications Doc. Eurostat/ITDG/October 2007/2.3.1 IT Directors Group 15 and 16 October 2007 BECH Building, 5, rue Alphonse Weicker, Luxembourg-Kirchberg Room QUETELET 9.30 a.m. - 5.30 p.m. 9.00 a.m 1.00 p.m. XML-based

More information

Just-In-Time Hypermedia

Just-In-Time Hypermedia A Journal of Software Engineering and Applications, 2013, 6, 32-36 doi:10.4236/jsea.2013.65b007 Published Online May 2013 (http://www.scirp.org/journal/jsea) Zong Chen 1, Li Zhang 2 1 School of Computer

More information

Implementation of a reporting workflow to maintain data lineage for major water resource modelling projects

Implementation of a reporting workflow to maintain data lineage for major water resource modelling projects 18 th World IMACS / MODSIM Congress, Cairns, Australia 13-17 July 2009 http://mssanz.org.au/modsim09 Implementation of a reporting workflow to maintain data lineage for major water Merrin, L.E. 1 and S.M.

More information

SQL Server Development 20762: Developing SQL Databases in Microsoft SQL Server Upcoming Dates. Course Description.

SQL Server Development 20762: Developing SQL Databases in Microsoft SQL Server Upcoming Dates. Course Description. SQL Server Development 20762: Developing SQL Databases in Microsoft SQL Server 2016 Learn how to design and Implement advanced SQL Server 2016 databases including working with tables, create optimized

More information

Spemmet - A Tool for Modeling Software Processes with SPEM

Spemmet - A Tool for Modeling Software Processes with SPEM Spemmet - A Tool for Modeling Software Processes with SPEM Tuomas Mäkilä tuomas.makila@it.utu.fi Antero Järvi antero.jarvi@it.utu.fi Abstract: The software development process has many unique attributes

More information

Part III. Issues in Search Computing

Part III. Issues in Search Computing Part III Issues in Search Computing Introduction to Part III: Search Computing in a Nutshell Prior to delving into chapters discussing search computing in greater detail, we give a bird s eye view of its

More information

Product Brief DESIGN GALLERY

Product Brief DESIGN GALLERY Product Brief DESIGN GALLERY Release Enhancements List Note: The intention of the below listing is to highlight enhancements that have been added to the product. The below does not list defect fixes that

More information

A Guide to Quark Author Web Edition 2015

A Guide to Quark Author Web Edition 2015 A Guide to Quark Author Web Edition 2015 CONTENTS Contents Getting Started...4 About Quark Author - Web Edition...4 Smart documents...4 Introduction to the Quark Author - Web Edition User Guide...4 Quark

More information

3.4 Data-Centric workflow

3.4 Data-Centric workflow 3.4 Data-Centric workflow One of the most important activities in a S-DWH environment is represented by data integration of different and heterogeneous sources. The process of extract, transform, and load

More information

QDA Miner. Addendum v2.0

QDA Miner. Addendum v2.0 QDA Miner Addendum v2.0 QDA Miner is an easy-to-use qualitative analysis software for coding, annotating, retrieving and reviewing coded data and documents such as open-ended responses, customer comments,

More information

6.001 Notes: Section 6.1

6.001 Notes: Section 6.1 6.001 Notes: Section 6.1 Slide 6.1.1 When we first starting talking about Scheme expressions, you may recall we said that (almost) every Scheme expression had three components, a syntax (legal ways of

More information

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel Breeding Guide Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel www.phenome-netwoks.com Contents PHENOME ONE - INTRODUCTION... 3 THE PHENOME ONE LAYOUT... 4 THE JOBS ICON...

More information

Education Services: Maximize the value of your data management investment with E-WorkBook Web E-WorkBook Web

Education Services: Maximize the value of your data management investment with E-WorkBook Web E-WorkBook Web Education Services: Maximize the value of your data management investment with E-WorkBook Web E-WorkBook Web (Super User 3 Days) IDBS Education Services provides a range of training options to help customers

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

Developing SQL Databases

Developing SQL Databases Course 20762B: Developing SQL Databases Page 1 of 9 Developing SQL Databases Course 20762B: 4 days; Instructor-Led Introduction This four-day instructor-led course provides students with the knowledge

More information

Utilizing a Common Language as a Generative Software Reuse Tool

Utilizing a Common Language as a Generative Software Reuse Tool Utilizing a Common Language as a Generative Software Reuse Tool Chris Henry and Stanislaw Jarzabek Department of Computer Science School of Computing, National University of Singapore 3 Science Drive,

More information

Reference Requirements for Records and Documents Management

Reference Requirements for Records and Documents Management Reference Requirements for Records and Documents Management Ricardo Jorge Seno Martins ricardosenomartins@gmail.com Instituto Superior Técnico, Lisboa, Portugal May 2015 Abstract When information systems

More information

This Document is licensed to

This Document is licensed to GAMP 5 Page 291 End User Applications Including Spreadsheets 1 Introduction This appendix gives guidance on the use of end user applications such as spreadsheets or small databases in a GxP environment.

More information

Creating a Pivot Table

Creating a Pivot Table Contents Introduction... 1 Creating a Pivot Table... 1 A One-Dimensional Table... 2 A Two-Dimensional Table... 4 A Three-Dimensional Table... 5 Hiding and Showing Summary Values... 5 Adding New Data and

More information

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life Shawn Bowers 1, Timothy McPhillips 1, Sean Riddle 1, Manish Anand 2, Bertram Ludäscher 1,2 1 UC Davis Genome Center,

More information

Scientific Graphing in Excel 2013

Scientific Graphing in Excel 2013 Scientific Graphing in Excel 2013 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.

More information

AUTOMATIC GRAPHIC USER INTERFACE GENERATION FOR VTK

AUTOMATIC GRAPHIC USER INTERFACE GENERATION FOR VTK AUTOMATIC GRAPHIC USER INTERFACE GENERATION FOR VTK Wilfrid Lefer LIUPPA - Université de Pau B.P. 1155, 64013 Pau, France e-mail: wilfrid.lefer@univ-pau.fr ABSTRACT VTK (The Visualization Toolkit) has

More information

The Solution Your Legal Department Has Been Looking For

The Solution Your Legal Department Has Been Looking For Blog Post The Solution Your Legal Department Has Been Looking For Microsoft Outlook to be my DM Desktop In my experience every Legal Department has users who previously worked at a law firm, where they

More information

Computer Science Lab Exercise 1

Computer Science Lab Exercise 1 1 of 10 Computer Science 127 - Lab Exercise 1 Introduction to Excel User-Defined Functions (pdf) During this lab you will experiment with creating Excel user-defined functions (UDFs). Background We use

More information

Re-using Data Mining Workflows

Re-using Data Mining Workflows Re-using Data Mining Workflows Stefan Rüping, Dennis Wegener, and Philipp Bremer Fraunhofer IAIS, Schloss Birlinghoven, 53754 Sankt Augustin, Germany http://www.iais.fraunhofer.de Abstract. Setting up

More information

DesignMinders: A Design Knowledge Collaboration Approach

DesignMinders: A Design Knowledge Collaboration Approach DesignMinders: A Design Knowledge Collaboration Approach Gerald Bortis and André van der Hoek University of California, Irvine Department of Informatics Irvine, CA 92697-3440 {gbortis, andre}@ics.uci.edu

More information

An Oracle White Paper April Oracle Application Express 5.0 Overview

An Oracle White Paper April Oracle Application Express 5.0 Overview An Oracle White Paper April 2015 Oracle Application Express 5.0 Overview Disclaimer The following is intended to outline our general product direction. It is intended for information purposes only, and

More information

Automated Road Segment Creation Process

Automated Road Segment Creation Process David A. Noyce, PhD, PE Director and Chair Traffic Operations and Safety Laboratory Civil and Environmental Engineering Department Kelvin R. Santiago, MS, PE Assistant Researcher Traffic Operations and

More information

Last updated: 17 March Steven Chalmers

Last updated: 17 March Steven Chalmers Last updated: 17 March 2015 Steven Chalmers PORTFOLIO Introduction I excel at Interaction Design. I am versed in Tufte and Wroblewski rather than color palettes and font families. The samples of work I

More information

Relationships and Traceability in PTC Integrity Lifecycle Manager

Relationships and Traceability in PTC Integrity Lifecycle Manager Relationships and Traceability in PTC Integrity Lifecycle Manager Author: Scott Milton 1 P age Table of Contents 1. Abstract... 3 2. Introduction... 4 3. Workflows and Documents Relationship Fields...

More information

Introduction to IRQA 4

Introduction to IRQA 4 Introduction to IRQA 4 Main functionality and use Marcel Overeem 1/7/2011 Marcel Overeem is consultant at SpeedSoft BV and has written this document to provide a short overview of the main functionality

More information

Microsoft Developing SQL Databases

Microsoft Developing SQL Databases 1800 ULEARN (853 276) www.ddls.com.au Length 5 days Microsoft 20762 - Developing SQL Databases Price $4290.00 (inc GST) Version C Overview This five-day instructor-led course provides students with the

More information

Database Developers Forum APEX

Database Developers Forum APEX Database Developers Forum APEX 20.05.2014 Antonio Romero Marin, Aurelien Fernandes, Jose Rolland Lopez De Coca, Nikolay Tsvetkov, Zereyakob Makonnen, Zory Zaharieva BE-CO Contents Introduction to the Controls

More information

Dremel Digilab 3D Slicer Software

Dremel Digilab 3D Slicer Software Dremel Digilab 3D Slicer Software Dremel Digilab 3D Slicer prepares your model for 3D printing. For novices, it makes it easy to get great results. For experts, there are over 200 settings to adjust to

More information

Chapter Twelve. Systems Design and Development

Chapter Twelve. Systems Design and Development Chapter Twelve Systems Design and Development After reading this chapter, you should be able to: Describe the process of designing, programming, and debugging a computer program Explain why there are many

More information

Teiid Designer User Guide 7.5.0

Teiid Designer User Guide 7.5.0 Teiid Designer User Guide 1 7.5.0 1. Introduction... 1 1.1. What is Teiid Designer?... 1 1.2. Why Use Teiid Designer?... 2 1.3. Metadata Overview... 2 1.3.1. What is Metadata... 2 1.3.2. Editing Metadata

More information

Technology for Merchandise Planning and Control

Technology for Merchandise Planning and Control Technology for Merchandise Planning and Control Contents: Module Three: Formatting Worksheets Working with Charts UREFERENCE/PAGES Formatting Worksheets... Unit C Formatting Values... Excel 52 Excel 57

More information

Using ESML in a Semantic Web Approach for Improved Earth Science Data Usability

Using ESML in a Semantic Web Approach for Improved Earth Science Data Usability Using in a Semantic Web Approach for Improved Earth Science Data Usability Rahul Ramachandran, Helen Conover, Sunil Movva and Sara Graves Information Technology and Systems Center University of Alabama

More information

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data June 2006 Note: This document is for informational purposes. It is not a commitment to deliver any material, code, or functionality,

More information

CHAPTER 7 CONCLUSION AND FUTURE WORK

CHAPTER 7 CONCLUSION AND FUTURE WORK CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion Data pre-processing is very important in data mining process. Certain data cleaning techniques usually are not applicable to all kinds of data. Deduplication

More information

Designed by Jason Wagner, Course Web Programmer, Office of e-learning NOTE ABOUT CELL REFERENCES IN THIS DOCUMENT... 1

Designed by Jason Wagner, Course Web Programmer, Office of e-learning NOTE ABOUT CELL REFERENCES IN THIS DOCUMENT... 1 Excel Essentials Designed by Jason Wagner, Course Web Programmer, Office of e-learning NOTE ABOUT CELL REFERENCES IN THIS DOCUMENT... 1 FREQUENTLY USED KEYBOARD SHORTCUTS... 1 FORMATTING CELLS WITH PRESET

More information

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes AN UNDER THE HOOD LOOK Databricks Delta, a component of the Databricks Unified Analytics Platform*, is a unified

More information

SILVACO. An Intuitive Front-End to Effective and Efficient Schematic Capture Design INSIDE. Introduction. Concepts of Scholar Schematic Capture

SILVACO. An Intuitive Front-End to Effective and Efficient Schematic Capture Design INSIDE. Introduction. Concepts of Scholar Schematic Capture TCAD Driven CAD A Journal for CAD/CAE Engineers Introduction In our previous publication ("Scholar: An Enhanced Multi-Platform Schematic Capture", Simulation Standard, Vol.10, Number 9, September 1999)

More information

20762B: DEVELOPING SQL DATABASES

20762B: DEVELOPING SQL DATABASES ABOUT THIS COURSE This five day instructor-led course provides students with the knowledge and skills to develop a Microsoft SQL Server 2016 database. The course focuses on teaching individuals how to

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

CA Productivity Accelerator 12.1 and Later

CA Productivity Accelerator 12.1 and Later CA Productivity Accelerator 12.1 and Later Localize Content Localize Content Once you have created content in one language, you might want to translate it into one or more different languages. The Developer

More information

Heuristic Evaluation of Covalence

Heuristic Evaluation of Covalence Heuristic Evaluation of Covalence Evaluator #A: Selina Her Evaluator #B: Ben-han Sung Evaluator #C: Giordano Jacuzzi 1. Problem Covalence is a concept-mapping tool that links images, text, and ideas to

More information

Using JBI for Service-Oriented Integration (SOI)

Using JBI for Service-Oriented Integration (SOI) Using JBI for -Oriented Integration (SOI) Ron Ten-Hove, Sun Microsystems January 27, 2006 2006, Sun Microsystems Inc. Introduction How do you use a service-oriented architecture (SOA)? This is an important

More information

6.001 Notes: Section 15.1

6.001 Notes: Section 15.1 6.001 Notes: Section 15.1 Slide 15.1.1 Our goal over the next few lectures is to build an interpreter, which in a very basic sense is the ultimate in programming, since doing so will allow us to define

More information

Scientific Graphing in Excel 2007

Scientific Graphing in Excel 2007 Scientific Graphing in Excel 2007 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.

More information

ZENworks Reporting System Reference. January 2017

ZENworks Reporting System Reference. January 2017 ZENworks Reporting System Reference January 2017 Legal Notices For information about legal notices, trademarks, disclaimers, warranties, export and other use restrictions, U.S. Government rights, patent

More information

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1 Basic Concepts :- 1. What is Data? Data is a collection of facts from which conclusion may be drawn. In computer science, data is anything in a form suitable for use with a computer. Data is often distinguished

More information

Multicore Computing and Scientific Discovery

Multicore Computing and Scientific Discovery scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research

More information

Comprehensive Guide to Evaluating Event Stream Processing Engines

Comprehensive Guide to Evaluating Event Stream Processing Engines Comprehensive Guide to Evaluating Event Stream Processing Engines i Copyright 2006 Coral8, Inc. All rights reserved worldwide. Worldwide Headquarters: Coral8, Inc. 82 Pioneer Way, Suite 106 Mountain View,

More information

PHP Composer 9 Benefits of Using a Binary Repository Manager

PHP Composer 9 Benefits of Using a Binary Repository Manager PHP Composer 9 Benefits of Using a Binary Repository Manager White Paper Copyright 2017 JFrog Ltd. March 2017 www.jfrog.com Executive Summary PHP development has become one of the most popular platforms

More information

A Guide to CMS Functions

A Guide to CMS Functions 2017-02-13 Orckestra, Europe Nygårdsvej 16 DK-2100 Copenhagen Phone +45 3915 7600 www.orckestra.com Contents 1 INTRODUCTION... 3 1.1 Who Should Read This Guide 3 1.2 What You Will Learn 3 2 WHAT IS A CMS

More information

Eloqua Insight Intro Analyzer User Guide

Eloqua Insight Intro Analyzer User Guide Eloqua Insight Intro Analyzer User Guide Table of Contents About the Course Materials... 4 Introduction to Eloqua Insight for Analyzer Users... 13 Introduction to Eloqua Insight... 13 Eloqua Insight Home

More information

Visualizing the Life and Anatomy of Dark Matter

Visualizing the Life and Anatomy of Dark Matter Visualizing the Life and Anatomy of Dark Matter Subhashis Hazarika Tzu-Hsuan Wei Rajaditya Mukherjee Alexandru Barbur ABSTRACT In this paper we provide a visualization based answer to understanding the

More information

Using Provenance to Support Real-Time Collaborative Design of Workflows

Using Provenance to Support Real-Time Collaborative Design of Workflows Using Provenance to Support Real-Time Collaborative Design of Workflows Tommy Ellkvist, David Koop 2, Erik W. Anderson 2, Juliana Freire 2, and Cláudio Silva 2 Linköpings universitet, Linköping, Sweden

More information

Course 40045A: Microsoft SQL Server for Oracle DBAs

Course 40045A: Microsoft SQL Server for Oracle DBAs Skip to main content Course 40045A: Microsoft SQL Server for Oracle DBAs - Course details Course Outline Module 1: Database and Instance This module provides an understanding of the two major components

More information

MutanT: A Modular and Generic Tool for Multi-Sensor Data Processing

MutanT: A Modular and Generic Tool for Multi-Sensor Data Processing 12th International Conference on Information Fusion Seattle, WA, USA, July 6-9, 2009 MutanT: A Modular and Generic Tool for Multi-Sensor Data Processing Simon Hawe, Ulrich Kirchmaier, Klaus Diepold Lehrstuhl

More information

3Lesson 3: Web Project Management Fundamentals Objectives

3Lesson 3: Web Project Management Fundamentals Objectives 3Lesson 3: Web Project Management Fundamentals Objectives By the end of this lesson, you will be able to: 1.1.11: Determine site project implementation factors (includes stakeholder input, time frame,

More information

1. WELDMANAGEMENT PRODUCT

1. WELDMANAGEMENT PRODUCT Table of Contents WeldManagement Product.................................. 3 Workflow Overview........................................ 4 Weld Types.............................................. 5 Weld

More information

INTRO TO EXCEL 2007 TOPICS COVERED. Department of Technology Enhanced Learning Information Technology Systems Division. What s New in Excel

INTRO TO EXCEL 2007 TOPICS COVERED. Department of Technology Enhanced Learning Information Technology Systems Division. What s New in Excel Information Technology Systems Division What s New in Excel 2007... 2 Creating Workbooks... 6 Modifying Workbooks... 7 Entering and Revising Data... 10 Formatting Cells... 11 TOPICS COVERED Formulas...

More information

SEARCH AND INFERENCE WITH DIAGRAMS

SEARCH AND INFERENCE WITH DIAGRAMS SEARCH AND INFERENCE WITH DIAGRAMS Michael Wollowski Rose-Hulman Institute of Technology 5500 Wabash Ave., Terre Haute, IN 47803 USA wollowski@rose-hulman.edu ABSTRACT We developed a process for presenting

More information

MICROSOFT BUSINESS INTELLIGENCE

MICROSOFT BUSINESS INTELLIGENCE SSIS MICROSOFT BUSINESS INTELLIGENCE 1) Introduction to Integration Services Defining sql server integration services Exploring the need for migrating diverse Data the role of business intelligence (bi)

More information

Warewolf User Guide 1: Introduction and Basic Concepts

Warewolf User Guide 1: Introduction and Basic Concepts Warewolf User Guide 1: Introduction and Basic Concepts Contents: An Introduction to Warewolf Preparation for the Course Welcome to Warewolf Studio Create your first Microservice Exercise 1 Using the Explorer

More information

One-PROC-Away: The Essence of an Analysis Database Russell W. Helms, Ph.D. Rho, Inc.

One-PROC-Away: The Essence of an Analysis Database Russell W. Helms, Ph.D. Rho, Inc. One-PROC-Away: The Essence of an Analysis Database Russell W. Helms, Ph.D. Rho, Inc. Chapel Hill, NC RHelms@RhoWorld.com www.rhoworld.com Presented to ASA/JSM: San Francisco, August 2003 One-PROC-Away

More information

Lo-Fidelity Prototype Report

Lo-Fidelity Prototype Report Lo-Fidelity Prototype Report Introduction A room scheduling system, at the core, is very simple. However, features and expansions that make it more appealing to users greatly increase the possibility for

More information