I ve Seen Enough : Incrementally Improving Visualizations to Support Rapid Decision Making

Size: px

Start display at page:

Download "I ve Seen Enough : Incrementally Improving Visualizations to Support Rapid Decision Making"

Egbert Malone
5 years ago
Views:

I ve Seen Enough : Increentally Iproving Visualizations to Support Rapid Decision Making Sajjadur Rahan Marya Aliakbarpour 2 Ha Kyung Kong Eric Blais 3 Karrie Karahalios,4 Aditya Paraeswaran Ronitt

1 I ve Seen Enough : Increentally Iproving Visualizations to Support Rapid Decision Making Sajjadur Rahan Marya Aliakbarpour 2 Ha Kyung Kong Eric Blais 3 Karrie Karahalios,4 Aditya Paraeswaran Ronitt Rubinfield 2 University of Illinois (UIUC) 4 Adobe Research 2 MIT 3 University of Waterloo {srahan7,hkong6,kkarahal,adityagp}@illinois.edu {aryaa@,ronitt@csail}.it.edu eblais@uwaterloo.ca ABSTRACT Data visualization is an effective echanis for identifying trends, insights, and anoalies in data. On large datasets, however, generating visualizations can take a long tie, delaying the extraction of insights, hapering decision aking, and reducing exploration tie. One solution is to use online sapling-based schees to generate visualizations faster while iproving the displayed estiates increentally, eventually converging to the exact visualization coputed on the entire data. However, the interediate visualizations are approxiate, and often fluctuate drastically, leading to potentially incorrect decisions. We propose sapling-based increental visualization algoriths that reveal the salient features of the visualization quickly with a 46 speedup relative to baselines while iniizing error, thus enabling rapid and errorfree decision aking. We deonstrate that these algoriths are optial in ters of saple coplexity, in that given the level of interactivity, they generate approxiations that take as few saples as possible. We have developed the algoriths in the context of an increental visualization tool, titled INCVISAGE, for trendline and heatap visualizations. We evaluate the usability of INCVIS- AGE via user studies and deonstrate that users are able to ake effective decisions with increentally iproving visualizations, especially copared to vanilla online-sapling based schees.. INTRODUCTION Visualization is eerging as the ost coon echanis for exploring and extracting value fro datasets by novice and expert analysts alike. This is evidenced by the huge popularity of data visualization tools such as Microsoft s PowerBI and Tableau, both of which have 00s of illions of dollars in revenue just this year [5, 3]. And yet data visualization on increasingly large datasets, reains cubersoe: when datasets are large, generating visualizations can take hours, ipeding interaction, preventing exploration, and delaying the extraction of insights [36]. One approach to generating visualizations faster is to saple a sall nuber of datapoints fro the dataset online; by using sapling, we can view visualizations that increentally iprove over tie and eventually converge to the visualization coputed on the entire data. However, such interediate visualizations are approxiate, and often fluctuate drastically, leading to incorrect insights and conclusions. The key question we wish to address in this paper is the following: can we develop a sapling-based increental visualization algorith that reveals the features of the eventual visualization quickly, but does so in a anner that is guaranteed to be correct? Illustrative Exaple. We describe the goals of our sapling algoriths via a pictorial exaple. In Figure, we depict, in the first row, the variation of present sapling algoriths as tie progresses and ore saples are taken: at t, t 2, t 4, t 7, and when all of the data has been sapled. This is, for exaple, what visualizing the results of an online-aggregation-like [9] sapling algorith ight provide. If a user sees the visualization at any of the interediate tie points, they ay ake incorrect decisions. For exaple, at tie t, the user ay reach an incorrect conclusion that the values at the start and the end are lower than ost of the trend, while in fact, the opposite is true this anoaly is due to the skewed saples that were drawn to reach t. The visualization continues to vary at t 2, t 4, and t 7, with values fluctuating randoly based on the saples that were drawn. Indeed, a user ay end up having to wait until the values stabilize, and even then ay not be able to fully trust the results. One approach to aeliorate this issue would be to use confidence intervals to guide users in deciding when to draw conclusions however, prior work has deonstrated that users are not able to interpret confidence intervals correctly [4]. At the sae tie, the users are subject to the sae vagaries of the saples that were drawn. Figure : INCVISAGE exaple. Another approach, titled INCVISAGE, that we espouse in this paper and depict in the second row is the following: at each tie point t i, reveal one additional segent for a i-segent trendline, by splitting one of the segents for the trendline at t i, when the tool is confident enough to do so. Thus, INCVISAGE is very conservative at t and just provides a ean value for the entire range, then at t 2, it splits the single segent into two segents, indicating that the trend increases towards the end. Overall, by t 7, the tool has indicated any of the iportant features of the trend: it starts off high, has a bup in the iddle, and then increases towards the end. This approach reveals features of the eventual visualization in the order of proinence, allowing users to gain early insights and draw conclusions early. This approach is reiniscent of interlaced pixelbased iage generation in browsers [38], where the iage slowly goes fro being blurry to sharp over the course of the rendering, displaying the ost salient features of the visualization before the less iportant features. Siilar ideas have also been developed in other doains such as signal processing [47] and geo aps [3]. So far, we ve described trendlines; the INCVISAGE approach can be applied to heatap visualizations as well depicted in row 4 for the corresponding online-aggregation-like approach shown in row 3 as is typical in heataps, the higher the value, the darker the

2 color. Once again row 3 the status quo fluctuates treendously, not letting analysts draw eaningful insights early and confidently. On the other hand, row 4 the INCVISAGE approach repeatedly subdivides a block into four blocks when it is confident enough to do so, ephasizing early, that the right hand top corner has a higher value, while the values right below it are soewhat lower. In fact, the interediate visualizations ay be preferable because users can get the big picture view without being influenced by noise. Open Questions. Naturally, developing INCVISAGE brings a whole host of open questions, that span the spectru fro usability to algorithic process. First, it is not clear at what rate we should be displaying the results of the increental visualization algorith. When can we be sure that we know enough to show the ith increent, given that the (i )th increent has been shown already? How should the ith increent differ fro the (i )th increent? How do we prioritize sapling to ensure that we get to the ith increent as soon as possible, but with guarantees? Can we show that our algorith is in a sense optial, in that it ais to take as few saples as possible to display the ith increent with guarantees? And at the sae tie, how do we ensure that our algorith is lightweight enough that coputation doesn t becoe a bottleneck? How do we place the control in the user s hands in order to control the level of interactivity needed? The other open questions involved are related to how users interpret increental visualizations. Can users understand and ake sense of the guarantees provided? Can they use these guarantees to ake well-infored decisions and terinate early without waiting for the entire visualization to be generated? Contributions. In this paper, we address all of these open questions. Our priary contribution in the paper is the notion of increentally iproving visualizations that surface iportant features as they are deterined with high confidence bringing a concept that is coonly used in other settings, e.g., rasterization and signal processing, to visualizations. Given a user specified interactivity threshold (described later), we develop increental visualizations algoriths for INCVISAGE that iniizes error. We introduce the concept of iproveent potential to help us pick the right iproveent per increent. We find, soewhat surprisingly, that a rearkably siple algorith works best under a sub-gaussian assuption [42] about the data, which is reasonable to assue in real-world datasets (as we show in this paper). We further deonstrate that these algoriths are optial in that they generate approxiations within soe error bound given the interactivity threshold. When users don t provide their desired interactivity threshold, we can pick appropriate paraeters that help the best navigate the tradeoff between error and latency. We additionally perfor user studies to evaluate the usability of an increental visualization interface, and evaluate whether users are able to ake effective decisions with increentally iproving visualizations. We found that they are able to effectively deterine when to stop the visualization and ake a decision, trading off latency and error, especially when copared to traditional online sapling schees. Prior Work. Our work is copleentary to other work in the increental visualization space. In particular, sapleaction [6] and online aggregation [9] both perfor online sapling to depict aggregate values, along with confidence-interval style estiates to depict the uncertainty in the current aggregates. In both cases, however, these approaches prevent users fro getting early insights since they need to wait for the values to stabilize. As we will discuss later, our approach can be used in tande with online aggregationbased approaches. IFOCUS [30], PFunk-H [], and ExploreSaple [50] are other approxiate visualization algoriths targeted at generating visualizations rapidly while preserving perceptual insights. IFOCUS ephasizes the preservation of pairwise ordering of bars in a bar chart, as opposed to the actual values; PFunk-H uses perceptual functions fro graphical perception research to terinate visualization generation early; ExploreSaple approxiates scatterplots, ensuring that overall distributions and outliers are preserved. An early paper by Hellerstein et al. [20] proposes a siilar technique of progressive rendering for scatterplot visualizations by using index statistics to depict estiates of density before the data records are actually fetched. Lastly, M4 [27] uses rasterization to reduce the diensionality of a tie series without ipacting the resulting visualization. None of these ethods ephasize revealing features of visualizations increentally. Outline. The outline of the reainder of this paper is as follows: in Section 2 we forally define the increental visualization proble. Section 3 outlines our increental visualization algorith while Section 4 details the syste architecture of INCVISAGE. In Section 5 we suarize the experiental results and the key takeaways. Then we present the user study design and outcoes in Section 6 (for usability) and 7 (for coparison to traditional online sapling schees). Section 8 describes other related work on data visualization and analytics. 2. PROBLEM FORMULATION In this section, we first describe the data odel, and the visualization types we focus on. We then forally define the proble. 2. Visualizations and Queries Data and Query Setting. We operate on a dataset consisting of a single large relational table R. Our approach also generalizes to ultiple tables obeying a star or a snowflake scheata but we focus on the single table case for ease of presentation. As in a traditional OLAP setting, we assue that there are diension attributes and easure attributes diension attributes are typically used as group-by attributes in aggregate queries, while easure attributes are the ones typically being aggregated. For exaple, in a product sales scenario, day of the year would be a typical diension attribute, while the sales would be a typical easure attribute. INCVISAGE supports two kinds of visualizations: trendlines and heataps. These two types of visualizations can be translated to aggregate queries Q T and Q H respectively: Q T = SELECT X a, AVG(Y) FROM R GROUP BY X a ORDER BY X a, and Q H = SELECT X a, X b, AVG(Y) FROM R GROUP BY X a, X b ORDER BY X a, X b Here, X a and X b are diension attributes in R, while Y is a easure attribute being aggregated. The trendline and heatap visualizations are depicted in Figure. For trendlines (query Q T ), the attribute X a is depicted along the x-axis while the aggregate AVG(Y) is depicted along the y-axis. On the other hand, for heataps (query Q H) the two attributes X a and X b are depicted along the x-axis and y-axis, respectively. The aggregate AVG(Y) is depicted by the color intensity for each block (i.e., each X a, X b cobination) of the heatap. Give a continuous color scale, the higher the value of AVG(Y), the higher the color intensity. The coplete set of queries (including WHERE clauses and other aggregates) that are supported by INCVISAGE can be found in Section 3.5 we focus on the siple setting for now. Note that we are iplicitly focusing on X a and X b that are ordinal, i.e., have an order associated with the so that it akes sense to visualize the as a trendline or a heatap. As we will deonstrate subsequently, this order is crucial in letting us approxiate portions of the visualization that share siilar behavior. Sapling Engine. We assue that we have a sapling engine that can efficiently retrieve rando tuples fro R corresponding to different values of the diension attribute(s) X a and/or X b (along with optional predicates fro a WHERE). Focusing on Q T for

3 now, given a certain value of X a = a i, our engine provides us a rando tuple that satisfies X a = a i. Then, by looking up the value of Y corresponding to this tuple, we can get an estiate for the average of Y for X a = a i. Our sapling engine is drawn fro Ki et al. [30] and aintains an in-eory bitap on the diension attributes, allowing us to quickly identify tuples atching arbitrary conditions [3]. Bitaps are highly copressible and effective for read-only workloads [32, 49, 48], and have been applied recently to sapling for approxiate generation of bar charts [30]. Thus, given our sapling engine, we can retrieve saples of Y given X a = a i (or X a = a i X b = b i for heat aps). We call the ultiset of values of Y corresponding to X a = a i across all tuples to be a group. This allows us to say that we are sapling fro a group, where iplicitly we ean we are sapling the corresponding tuples and retrieving the Y value. Next, we describe our proble of increentally generating visualizations ore forally. We focus on trendlines (i.e., Q T ); the corresponding definitions and techniques for heataps are described in our extended technical report [2]. 2.2 Increental Visualizations Fro this point forward, we describe the concepts in the context of our visualizations in row 2 of Figure. Segents and k-segent Approxiations. We denote the cardinality of our group-by diension attribute X a as, i.e., X a =. In Figure this value is 36. At all tie points over the course of visualization generation, we display one value of AVG(Y) corresponding to each group x i X a, i... thus, the user is always shown a coplete trendline visualization. To approxiate our trendlines, we use the notion of segents that encopass ultiple groups. We define a segent as follows: Definition. A segent S corresponds to a pair (I, η), where η is a value, while I is an interval I [, ] spanning a consecutive sequence of groups x i X a. For exaple, the segent S ([2, 4], 0.7) corresponds to the interval of x i corresponding to x 2, x 3, x 4, and has a value of 0.7. Then, a k-segent approxiation of a visualization coprises k disjoint segents that span the entire range of x i, i =.... Forally: Definition 2. A k-segent approxiation is a tuple L k = ( ) S,..., S k such that the segents S,..., S k partition the interval [, ] into k disjoint sub-intervals. In Figure, at t 2, we display a 2-segent approxiation, with segents ([, 30], 0.4) and ([3, 36], 0.8); and at t 7, we display a 7- segent approxiation, coprising ([, 3], 0.8), ([4, 4], 0.4),..., and ([35, 36], 0.7). When the nuber of segents is unspecified, we siply refer to this as a segent approxiation. Increentally Iproving Visualizations. We are now ready to describe our notion of increentally iproving visualizations. Definition 3. An increentally iproving visualization is defined to be a sequence of segent approxiations, (L,..., L ), where the ith ite L i, i > in the sequence is a i-segent approxiation, fored by selecting one of the segents in the (i )- segent approxiation L i, and dividing that segent into two. Thus, the segent approxiations that coprise an increentally iproving visualization have a very special relationship to each other: each one is a refineent of the previous, revealing one new feature of the visualization and is fored by dividing one of the segents S in the i-segent approxiation into two new segents to give an (i + )-segent approxiation: we call this process splitting a segent. The group within S L i iediately following which the split occurs is referred to as a split group. Any group in the interval I S except for the last group can be chosen as a split group. As an exaple, in Figure, the entire second row corresponds to an increentally iproving visualization, where the 2-segent approxiation is generated by taking the segent in the -segent approxiation corresponding to ([, 36], 0.5), and splitting it at group 30 to give ([, 30], 0.4) and ([3, 36], 0.8). Therefore, the split group is 30. The reason why we enforce two neighboring segent approxiations to be related in this way is to ensure that there is continuity in the way the visualization is generated, aking it a sooth user experience. If, on the other hand, each subsequent segent approxiation had no relationship to the previous one, it could be a very jarring experience for the user with the visualizations varying drastically, aking it hard for the to be confident in their decision aking. Underlying Data Model and Output Model. To characterize the perforance of an increentally iproving visualization, we need a odel for the underlying data. We represent the underlying data as a sequence of distributions D,..., D with eans µ,..., µ where, µ i = AVG(Y) for x i X a. To generate our increentally iproving visualization and its constituent segent approxiations, we draw saples fro distributions D,..., D. Our goal is to approxiate the ean values (µ,..., µ ) while taking as few saples as possible. The output of a k-segent approxiation L k can be represented alternately as a sequence of values (ν,..., ν ) such that ν i is equal to the value corresponding to the segent that encopasses x i, i.e., xi S j ν i = η j, where S j(i, η j) L k. By coparing (ν,..., ν ) with (µ,..., µ ), we can evaluate the error corresponding to a k-segent approxiation, as we describe later. Increentally Iproving Visualization Generation Algorith. Given our data odel, an increentally iproving visualization generation algorith proceeds in iterations: at the ith iteration, this algorith takes soe saples fro the distributions D,..., D, and then at the end of the iteration, it outputs the i-segent approxiation L i. We denote the nuber of saples taken in iteration i as N i. When L is output, the algorith terinates. 2.3 Characterizing Perforance There are ultiple ways to characterize the perforance of increentally iproving visualization generation algoriths. Saple Coplexity, Interactivity and Wall-Clock Tie. The first and ost straightforward way to evaluate perforance is by easuring the saples taken in each iteration k, N k, i.e., the saple coplexity. Since the tie taken to acquire the saples is proportional to the nuber of saples in our sapling engine (as shown in [30]), this is a proxy for the tie taken in each iteration. Recent work has identified 500s as a rule of thub for interactivity in exploratory visual data analysis [36], beyond which analysts end up getting frustrated, and as a result explore fewer hypotheses. To enforce this rule of thub, we can ensure that our algoriths take only as any saples per iteration as is feasible within 500s a tie budget. We also introduce a new etric called interactivity that quantifies the overall user experience: λ = k= N k ( k + ) where N k is the nuber of saples taken at iteration k and k is the nuber of iterations where N k > 0. The larger the λ, the saller the interactivity: this easure essentially captures the average waiting tie across iterations where saples are taken. We explore the easure in detail in Section 3.4 and 5.5. A ore direct way to evaluate perforance is to easure the wall-clock tie for each iteration. Error Per Iteration. Since our increentally iproving visualization algoriths trade off taking fewer saples to return results faster, it can also end up returning segent approxiations that are incorrect. We define the l 2 squared error of a k-segent approxiation L k with output sequence (ν,..., ν ) for the distributions k

4 D,..., D with eans µ,..., µ as err(l k ) = (µ i ν i) 2 () We note that there are several reasons a given k-segent approxiation ay be erroneous with respect to the underlying ean values (µ,..., µ ): () We are representing the data using k- segents as opposed to -segents. (2) We use increental refineent: each segent approxiation builds on the previous, possibly erroneous estiates. (3) The estiates for each group and each segent ay be biased due to sapling. These types of error are all unavoidable the first two reasons enable a visualization that increentally iproves over tie, while the last one occurs whenever we perfor sapling: () While a k-segent approxiation does not capture the data exactly, it provides a good approxiation preserving visual features, such as the overall ajor trends and the regions with a large value. Moreover, coputing an accurate k-segent approxiation requires fewer saples and therefore faster than an accurate -segent approxiation. (2) Increental refineent allows users to have a fluid experience, without the visualization changing copletely between neighboring approxiations. At the sae tie, is not uch worse in error than visualizations that change copletely between approxiations, as we will see later. 2.4 Proble Stateent The goal of our increentally iproving visualization generation algorith is, at each iteration k, generate a k-segent approxiation that is not just close to the best possible refineent at that point, but also takes the least nuber of saples to generate such an approxiation. Further, since the decisions ade by our algorith can vary depending on the saples retrieved, we allow the user to specify a failure probability δ, which we expect to be close to zero, such that the algorith satisfies the closeness guarantee with probability at least δ. Proble. Given a query Q T, and the paraeters δ, ɛ, design an increentally iproving visualization generation algorith that, at each iteration k, returns a k-segent approxiation while taking as few saples as possible, such that with probability δ, the error of the k-segent approxiation L k returned at iteration k does not exceed the error of the best k-segent approxiation fored by splitting a segent of L k by no ore than ɛ. 3. VISUALIZATION ALGORITHMS In this section, we gradually build up our solution to Proble. We start with the ideal case when we know the eans of all of the distributions up-front, and then work towards deriving an error guarantee for a single iteration. Then, we propose our increentally iproving visualization generation algorith ISplit assuing the sae guarantee across all iterations. We further discuss how we can tune the guarantee across iterations in Section 3.4. We consider extensions to other query classes in Section 3.5. We can derive siilar algoriths and guarantees for heataps [2]. All the proofs can be found in the technical report [2]. 3. Case : The Ideal Scenario We first consider the ideal case where the eans µ,..., µ of the distributions D,..., D are known. Then, our goal reduces to obtaining the best k segent approxiation of the distributions at iteration k, while respecting the refineent restriction. Say the increentally iproving visualization generation algorith has obtained a k-segent approxiation L k at the end of iteration k. Then, at iteration (k + ), the task is to identify a segent S i L k such that splitting S i into two new segents T and U iniizes the i= error of the corresponding (k + )-segent approxiation L k+. We describe the approach, followed by its justification. Approach. At each iteration, we split the segent S i L k into T U T and U that axiizes the quantity S i (µt µu )2. Intuitively, this quantity defined below as the iproveent potential of a refineent picks segents that are large, and within the, splits where we get roughly equal sized T and U, with large differences between µ T and µ U. Justification. The l 2 squared error of a segent S i (I i, η i), where I i = [p, q] and p q, for the distributions D p,..., D q with eans µ p,..., µ q is err(s i ) = (q p + ) q j=p (µ j η i ) 2 = S i j S i (µ j η i ) 2 Here, S i is the nuber of groups (distributions) encopassed by segent S i. When the eans of the distributions are known, err(s i) will be iniized if we represent the value of segent S i as the ean of the groups encopassed by S i, i.e., setting η i = µ Si = j S i µ j/ S i iniizes err(s i). Therefore, in the ideal scenario, the error of the segent S i is err(s i ) = S i (µ j η i ) 2 = S j S i i j S i µ 2 j µ2 S i (2) Then, using Equation, we can express the l 2 squared error of the k-segent approxiation L k as follows: err(l k ) = k (µ i ν i ) 2 S i = err(s i) i= Now, L k+ is obtained by splitting a segent S i L k into two segents T and U. Then, the error of L k+ is: err(l k+ ) = err(l k ) S i err(s i) + T U err(t ) + err(u) T U = err(l k ) S i (µ T µ U ) 2. We use the above expression to define the notion of iproveent potential. The iproveent potential of a segent S i L k is the iniization of the error of L k+ obtained by splitting S i into T and U. Thus, the iproveent potential of segent S i relative to T and U is T U (S i, T, U) = S i (µt µu )2 (3) For any segent S i = (I i, η i), every group in the interval I i except the last one can be chosen to be the split group (see Section 2.2). The split group axiizing the iproveent potential of S i, iniizes the error of L k+. The axiu iproveent potential of a segent is expressed as follows: T U (S i) = ax (S i, T, U) = ax (µt µu )2 T,U S i T,U S i S i Lastly, we denote the iproveent potential of a given L k+ by φ(l k+, S i, T, U), where φ(l k+, S i, T, U) = (S i, T, U). Therefore, the axiu iproveent potential of L k+, φ (L k+ ) = ax Si L k (S i). When the eans of the distributions are known, at iteration (k + ), the optial algorith siply selects the refineent corresponding to φ (L k+ ), which is the segent approxiation with the axiu iproveent potential. 3.2 Case 2: The Online-Sapling Scenario We now consider the case where the eans µ,..., µ are unknown. Siilar to the previous case, we want to identify a segent S i L k such that splitting S i into T and U results in the axiu iproveent potential. We will first describe our approach for a given iteration assuing saples have been taken, then we will describe our approach for selecting saples. i=

5 3.2. Selecting the Refineent Given Saples Approach. Unfortunately, since the eans are unknown, we cannot easure the exact iproveent potential, so we iniize the epirical iproveent potential based on saples seen so far. Analogous to the previous section, we siply pick the refineent that T U axiizes S i ( µt µu )2, where the µs are the epirical estiates of the eans. Justification. At iteration (k + ), we first draw saples fro the distributions D,..., D to obtain the estiated eans µ,..., µ. For each S i L k, we set its value to η i = µ Si = j S i µ j/ S i, which we call the estiated ean of S i. For any refineent L k+ of L k, we then let the estiated iproveent potential of L k+ be φ(l k+, S i, T, U) = T µ 2 T + U µ 2 U S µ 2 T U S = i S i ( µ T µ U ) 2 For siplicity we denote φ(l k+, S i, T, U) and φ(l k+, S i, T, U) as φ(l k+ ) and φ(l k+ ), respectively. Our goal is to get a guarantee for err(l k+ ) is relative to err(l k ). Instead of a guarantee on the actual error err, for which we would need to know the eans of the distributions, our guarantee is instead on a new quantity, err, which we define to be the epirical error. Given S i = (I i, η i) and Equation 2, err is defined as follows: err (S i) = S i j S i µ 2 j ηi 2. Although err (S i) is not equal to err(s i) when η i µ S, err (S i) converges to err(s i) as η i gets closer to µ S (i.e., as ore saples are taken). Siilarly, the error of k-segent approxiation err (L k ) converges to err(l k ). We show experientally (see Section 5) that optiizing for err (L k+ ) gives us a good solution of err(l k+ ) itself. To derive our guarantee on err, we need one ore piece of terinology. At iteration (k+), we define T (I, η), where I = [p, q] and p q to be a boundary segent if either p or q is a split group in L k. In other words, at the iteration (k + ), all the segents in L k and all the segents that ay appear in L k+ after splitting a segent are called boundary segents. Next, we show that if we estiate the boundary segents accurately, then we can find a split which is very close to the best possible split. Theore. If for every boundary segent T of the k-segent approxiation L k, we obtain an estiate µ T of the ean µ T that satisfies 2 µ T µ T 2 ɛ, then the refineent 6 T L k+ of L k that axiizes the estiated value φ(l k+ ) will have error that exceeds the error of the best refineent L k+ of L k by at ost err (L k+ ) err (L k+) ɛ Deterining the Saple Coplexity To achieve the guarantee for Theore, we need to retrieve a certain nuber of saples fro each of the distributions. Approach. Perhaps soewhat surprisingly, we find that we need to draw a constant C = 288 a σ 2 ɛ 2 ln ( 4 δ ) fro each distribution D i to satisfy the requireents of Theore. (We will define these paraeters subsequently.) Thus, our sapling approach is rearkably siple and is essentially unifor sapling plus, as we show in the next subsection, other approaches cannot provide significantly better guarantees. What is not siple, however, is showing that taking C saples allows us to satisfy the requireents for Theore. Another benefit of unifor sapling is that we can switch between showing our segent approxiations or showing the actual running estiates of each of the groups (as in online aggregation [9]) for the latter purpose, unifor sapling is trivially optial. Justification. To justify our choice, we assue that the data is generated fro a sub-gaussian distribution [42]. Sub-Gaussian distributions for a general class of distributions with a strong tail decay property, with its tails decaying at least as rapidly as the tails of a Gaussian distribution. This class encopasses Gaussian distributions as well as distributions where extree outliers are rare this is certainly true when the values derive fro real observations that are bounded. We validate this in our experients. Therefore, we represent the distributions D,..., D by sub-gaussian distributions with ean µ i and sub-gaussian paraeter σ. Given this generative assuption, we can deterine the nuber of saples required to obtain an estiate with a desired accuracy using Hoeffding s inequality [2]. Given Hoeffding s inequality along with the union bound, we can derive an upper bound on the nuber of saples we need to estiate the ean µ i of each x i. Lea. For a fixed δ > 0 and a k-segent approxiation of the distributions D,..., D represented by independent rando saples x,..., x with sub-gaussian paraeter σ 2 and ean 288 a σ 2 ɛ 2 µ i [0, a] if we draw C = ln ( ) 4 saples uniforly δ fro each x i, then with probability at least δ, 2 µ T µ T 2 ɛ for every boundary segent T of L 6 T k Deriving a Lower bound We can derive a lower bound for the saple coplexity of any algorith for Proble : Theore 2. Say we have a dataset D of groups with eans (µ, µ 2,..., µ ). Assue there exists an algorith A that finds a k-segent approxiation which is ɛ-close to the optial k-segent approxiation with probability 2/3. For sufficiently sall ɛ, A draws Ω( k/ɛ 2 ) saples. The theore states that any algorith that outputs a k-segent approxiation of which is ɛ-close to a dataset has to draw O( k/ɛ 2 ) saples fro the dataset. 3.3 The ISplit Algorith We now present our increentally iproving visualization generation algorith ISplit. Given the paraeters ɛ and δ, ISplit aintains the sae guarantee of error (ɛ) in generating the segent approxiations in each iteration. Theore and Lea suffice to show that the ISplit algorith is a greedy approxiator, that, at each iteration identifies a segent approxiation that is at least ɛ close to the best segent approxiation for that iteration. Data: X a, Y, δ, ɛ Start with the -segent approxiator L = (L ). 2 for k = 2,..., do 3 L k = (S,..., S k ). 4 for each S i L k do 5 for each group q S p do 6 Draw C saples uniforly. Copute ean µ q end 7 Copute µ Si = S i q S µ q i end T U 8 Find (T, U) = argax i; T,U Si S i ( µ T µ U ) 2. 9 Update L k+ = S,..., S i, T, U, S i+,..., S k. end Algorith : ISplit The paraeter ɛ is often difficult for end-users to specify. Therefore, in INCVISAGE, we allow users to instead explicitly specify an expected tie budget per iteration, B as explained in Section 2.3, the nuber of saples taken deterines the tie taken per iteration, so we can reverse engineer the nuber of saples fro B. So, given the sapling rate f of the sapling engine (see Section 2.) and B, the saple coplexity per iteration per group is siply C = B f/. Using Lea, we can copute the corresponding ɛ. Another way of setting B is to use standard rules of thub for interactivity (see Section 2.3).

6 3.4 Tuning ɛ Across Iterations So far, we have considered only the case where the algorith k= T k k, where k takes the sae nuber of saples C per group across iterations and does not reuse the saples across different iterations. If we instead reuse the saples drawn in previous iterations, the error at iteration k is ɛ a σ2 k = ln ( ) 4 kc δ, where k C is the total nuber of saples drawn so far. Therefore, the decrease in the error up to iteration k, ɛ k, fro the error up to iteration (k ), ɛ k, is (k )/k, where ɛ k = (k )/k ɛ k. This has the following effect: later iterations both take the sae aount of tie as previous ones, and produce only inor updates of the visualization. Sapling Approaches. This observation indicates that we can obtain higher interactivity with only a sall effect on the accuracy of the approxiations by considering variants of the algorith that decrease the nuber of saples across iterations instead of drawing the sae nuber of saples in each iteration. We consider three natural approaches for deterining the nuber of saples we draw at each iteration: linear decrease (i.e., reduce the nuber of saples by β at each iteration), geoetric decrease (i.e., divide the nuber of saples by α at each iteration), and all-upfront (i.e., non-interactive) sapling. To copare these different approaches to the constant (unifor) sapling approach and aongst each other, we first copute the total saple coplexity, interactivity, and error guarantees of each of the as a function of their other paraeters. Letting T k denote the total nuber of saples drawn in the first k iterations, the interactivity of a sapling approach defined in Section 2.3 can be written as: λ = is the nuber of iterations we take non-zero saples. The error guarantees we consider are, as above, the average error over all iterations. This error guarantee is err = ɛ 2 k = i= i= 288 a σ 2 2 T k ln ( 4 δ ) = 288 a σ 2 2 ln ( ) 4 δ i= We provide evidence that the estiated err and λ are siilar to err and λ on real datasets in Section We are now ready to derive the expressions for both error and interactivity for the sapling approaches entioned earlier. Expressions for Error and Interactivity. The analytical expressions of both interactivity and error for all of these four approaches are shown in Table and derived in the technical report [2]. In particular, for succinctness, we have left the forulae for Error for geoetric and linear decrease as is without siplification; on substituting T k into the expression for Error, we obtain fairly coplicated forulae with dozens of ters, including ultiple uses of the q-digaa function Ψ [25] for the geoetric decrease case, the Euler-Mascheroni constant [44] for the linear decrease case, and the Haronic nuber H for the constant sapling case: these expressions are provided in the technical report [2]. Coparing the Sapling Approaches. We first exaine the allupfront sapling approach. We find that the all-upfront sapling approach is strictly worse than constant sapling (which is a special case of linear and geoetric decrease). Theore 3. If for a setting of paraeters, a constant sapling approach and an all-upfront sapling approach have the sae interactivity, then the error of constant sapling is strictly less than the error of all-upfront sapling. Our experiental results in Section 5.5 suggests that the geoetric decrease approach with the optial choice of paraeter α has better error than not only the all-upfront approach but the linear decrease and constant sapling approaches as well. This reains an open question, but when we fix the initial saple coplexity N (which is proportional to the bound on the axiu tie taken per iteration as provided by the user), we can show that geoetric decrease with the right choice of paraeter α does have the optial T k interactivity aong the three interactive approaches. Theore 4. Given N, the interactivity of geoetric decrease with paraeter α = (N ) /( ) is strictly better than the interactivity of any linear decrease approach and constant sapling. The proof is established by first showing that for both linear and geoetric decrease, the optial interactivity is obtained for an explicit choice of decrease paraeter that depends only on the initial nuber of saples and the total nuber of iterations. Lea 2. In geoetric sapling with a fixed N, α = (N ) /( ) has the optial interactivity and it has saller error than all of the geoetric sapling approaches with α > α. Lea 3. In linear sapling with a fixed N, β = (N )/( ) has the optial interactivity and it has strictly saller error than all of the linear sapling approaches with β > β. Optial α and Knee Region. As shown in Lea 2, given N, we can copute the optial decrease paraeter α, such that any value of α > α results in higher error and lesser interactivity (higher λ). This behavior results into the eergence of a knee region in the error-interactivity curve which is confired in our experiental results (see Section 5.5). Essentially, starting fro α = any value α α has saller error than any value α > α. Therefore, for any given N there is an optial region [, α ]. For exaple, for N = 5000, 25000, 5000, and 0000, the optial range of α is [,.024], [,.028], [,.03], and [,.032], respectively. By varying α along the optial region one can either optiize for error or interactivity. For exaple, starting with α = as α α we trade accuracy for interactivity. 3.5 Extensions The increentally iproving visualization generation algorith described previously for siple queries with the AVG aggregation function can also be extended to support aggregations such as COUNT and SUM, as well as additional predicates (via a WHERE clause) details and proofs can be found in the technical report [2] The COUNT Aggregate Function Given that we know the total nuber of tuples in the database, estiating the COUNT aggregate function essentially corresponds to the proble of estiating the fraction of tuples τ i that belong to each group i. We focus on the setting below when we only use our bitap index. We note that approxiation techniques for COUNT queries have also been studied previously [8, 24, 26], for the case when no indexes are available. Using our sapling engine, we can estiate the fractions τ i by scanning the bitap index for each group. When we retrieve another saple fro group i, we can also estiate the nuber of tuples we needed to skip over to reach the tuple that belongs to group i the indices allow us to estiate this inforation directly, providing us an estiate for τ i, i.e., τ i. Theore 5. With an expected total nuber of saples C count = + γ 2 ln(2/δ)/2, the τ i, i can be estiated to within a factor γ, i.e., τ i τ i γ holds with probability at least δ. Essentially, the algorith takes as any saples fro each group until the index of the saple is γ 2 ln(2/δ)/2. We can show that overall, the expected nuber of saples is bounded above by C count. Since this nuber is sall, we don t even need to increentally estiate or visualize COUNT The SUM Aggregate Function. There are two settings we consider for the SUM aggregate function: when the group sizes (i.e., the nuber of tuples in each group) are known, and when they are unknown. Known Group Sizes. A siple variant of Algorith can also be used to approxiate SUM in this case. Let n i be the size of

7 Approach decrease paraeter T k Error Interactivity Linear Decrease β N k k (k ) β 2 A ( ) k= T N k N k 2 + β [ 6 2k 2 3k + 3k + 3 ] α Geoetric Decrease α N k α k A [ (α ) k= N (α ) α +α ] T k (α ) 2 α Constant Sapling α =, or β = 0 kn A H N N (+) 2 A All-upfront T k= = N and T k> = 0 N N 288 a σ2 Table : Expressions for error and interactivity for different sapling approaches. Here, A = 2 group i and κ = ax j n j. As in the original setting, the algorith coputes the estiate µ i of the average of the group. Then s i = n i µ i is an estiate on the su of each group i, naely s i, that is used in place of the average in the rest of the algorith. We have: Theore 6. Assue we have C i = 288a2 σ 2 n 2 i ɛ 2 κ 2 ln 4 δ saples fro group i. Then, the refineent L k+ of L k that axiizes the estiated iproveent potential φ + (L k+ ) will have error that exceeds the error of the best refineent L k+ of L k by at ost err + (L k+ ) err + (L k+) ɛκ 2. Note that here we have ɛκ 2 instead of ɛ: while at first glance this ay see like an aplification of the error, it is not so: first, the scales are different and visualizing the SUM is like visualizing AVG ultiplied by κ an error by one pixel for the forer would equal κ ties the error by one pixel for the latter but are visually equivalent; second, our error function uses a squared l 2- like distance, explaining the κ 2. Unknown Group Sizes. For this case, we siply eploy the known group size case, once the COUNT estiation is used to first approxiate the τ i with γ = ɛ/24a. We have s i = τ i µ iκ t, where κ t denotes the total nuber of ites: i= ni. Theore 7. Assue we have C i = 52 a2 σ 2 τ 2 i ɛ 2 ln 4 δ saples fro group i. Then, the refineent L k+ of L k that axiizes the estiated φ + (L k+ ) will have error that exceeds the error of the best refineent L k+ of L k by at ost err + (L k+ ) err + (L k+) ɛκ 2 t. Q T = SELECT X a, AVG(Y) FROM R GROUP BY X a ORDER BY X a WHERE Pred 4. INCVISAGE SYSTEM ARCHITECTURE The INCVISAGE client is a web-based front-end that captures user input and renders visualizations produced by the INCVISAGE back-end. The INCVISAGE back-end is coposed of three coponents: (A) a query parser, which is responsible for parsing the input query Q T or Q H (see Section 2.), (B) a view processor, which executes ISplit (see Section 3), and (C) a sapling engine which returns the saples requested by the view processor at each iteration. As discussed in Section 2., INCVISAGE uses a bitap-based sapling engine to retrieve a rando saple of records atching a set of ad-hoc conditions [30]. At the end of each iteration, the view processor sends the visualizations generated to the front-end encoded in json. To reduce the aount of json data transferred, only the changes in the visualization are counicated to the front-end. The front-end is responsible for capturing user input and rendering visualizations generated by the INCVISAGE back-end. The visualizations (i.e., the segent approxiations) generated by the back-end increentally iprove over tie, but the users can pause and replay the visualizations on deand. The details of the frontend and screenshots, along with an architecture diagra can be found the technical report [2] 5. PERFORMANCE EVALUATION ln ( ) 4 δ We evaluate our algoriths on three priary etrics: the error of the visualizations, the degree of correlation with the best algorith in choosing the split groups, and the runtie perforance. We also discuss how the choice of the initial saples N and sapling factor α ipacts the error and interactivity. 5. Experiental Setup Algoriths Tested. Each of the increentally iproving visualization generation algoriths that we evaluate perfors unifor sapling, and take either B (tie budget) and f (sapling rate of the sapling engine) as input, or ɛ (desired error threshold) and δ (the probability of correctness) as input, and in either case coputes the required N and α. The algoriths are as follows: ISplit: At each iteration k, the algorith draws N k saples uniforly fro each group, where N k is the total nuber of saples requested at iteration k and is the total nuber of groups. ISplit uses the concept of iproveent potential (see Section 3) to split an existing segent into two new segents. RSplit: At each iteration k, the algorith takes the sae saples as ISplit but then selects a segent, and a split group within that, all at rando to split. ISplit-Best: The algorith siulates the ideal case where the ean of all the groups are known upfront (see Section 3.). The visualizations generated have the lowest possible error at any given iteration (i.e. for any k-segent approxiation) aong approaches that perfor refineent of previous approxiation. DPSplit: We describe the details of this algorith in our technical report [2], but at a high level, at each iteration k, this algorith takes the sae saples as ISplit, but instead of perforing refineent, DPSplit recoputes the best possible k-segent approxiation using dynaic prograing. We include this algorith to easure the ipact of the iterative refineent constraint. DPSplit-Best: This algorith siulates the case where the eans of all the groups are known upfront, and the sae approach for producing k-segent approxiations used by DPSplit is used. The visualizations generated have the lowest possible error aong the algoriths entioned above since they have advance knowledge of the eans, and do not need to obey iterative refineent. Nae Description #Rows Size (GB) E U Sensor Intel Sensor dataset []. 2.2M 0.73 FL US Flight dataset [6]. 74M 7.2 T 20 NYC taxi trip data [4] 70M 6.3 T2 202 NYC taxi trip data [4] 66M 4.5 T3 203 NYC taxi trip data [4] 66M 4.3 WG Weather data of US [7] 45M 27 Table 2: Datasets Used for the Experients and User Studies. E = Experients and U = User Studies (Section 6 and 7). Datasets and Queries. The datasets used in the perforance evaluation experients and the user studies (Section 6 and Section 7) are shown in Table 2 and are arked by ticks ( ) to indicate where they were used. For the US flight dataset (FL) we show the results for the attribute Arrival Delay. Since all of the three years exhibit siilar results for the NYC taxi dataset, we only present the results for the year 20 (T) for the attribute Trip Tie. For the weather dataset, we show results for the attribute Teperature. The results for the other datasets are included in our technical report [2]. To verify our sub-gaussian assuption, we generated a Q-Q plot [46] for each of the selected attributes to copare the

8 distributions with Gaussian distributions. The FL, T, T2, and T3 datasets all exhibit right-skewed Gaussian distributions while WG exhibits a truncated Gaussian distribution [2]. Unless stated explicitly, we use the sae query in all the experients calculate the average of an attribute at each day of the year. Metrics. We use the following etrics to evaluate the algoriths: Error: We easure the error of the visualizations generated at each iteration k via err(l k ) (see Section 2.3). The average error across iterations is coputed as: ẽrr(l k ) = k= err(l k). Tie: We also evaluate the wall-clock execution tie. Correlation: We use Spearan s ranked correlation coefficient to easure the degree of correlation between two algoriths in choosing the order of groups as split groups. Interactivity: We use a new etric called interactivity (defined in Section 2.3) to select appropriate paraeters for ISplit. Interactivity is essentially the average waiting tie for generating the segent approxiations. We explore the easure in Section 5.5. Ipleentation Setup. We evaluate the runtie perforance of all our algoriths on a bitap-based sapling engine [3]. In addition, we ipleent a SCAN algorith which perfors a sequential scan of the entire dataset. This is the approach a traditional database syste would use. Since both ISplit and SCAN are ipleented on the sae engine, we can ake direct coparisons of execution ties between the two algoriths. All our experients are single threaded and are conducted on a HP-Z230-SFF workstation with an Intel Xeon E3-240 CPU and 6GB eory running Ubuntu 2.04 LTS. We set the disk read block size to 256KB. Unless explicitly stated we assue the tie budget B = 500s and use the paraeter values of N = 25000, α =.02 (with a geoetric decrease), and δ = All experients were averaged over 30 trials. 5.2 Coparing Execution Tie In this section, we copare the Wall Clock tie of ISplit, DPSplit and SCAN for the datasets entioned in Table 2. Suary: ISplit is several orders of agnitude faster than SCAN in revealing increental visualizations. The copletion tie of DPSplit exceeds the copletion tie of even SCAN. Moreover, when generating the early segent approxiations, DPSplit always exhibits higher latency copared to ISplit. Wall Clock Tie (sec) SCAN Copletion-DPSplit Iteration = 00-DPSplit Iteration = 50-DPSplit Iteration = 0-DPSplit Copletion-ISplit Iteration = 00-ISplit Iteration = 50-ISplit Iteration = 0-ISplit FL (70) T (70) WG (45) Datasets Figure 2: Coparison of Wall Clock Tie. We depict the wall-clock tie in Figure 2 for three different datasets for ISplit and DPSplit at iteration 0, 50, and 00, and at copletion, and for SCAN. First note that as the size of the dataset increases, the tie for SCAN drastically increases since the aount of data that needs to be scanned increases. On the other hand, the tie for copletion for ISplit stays stable, and uch saller than SCAN for all datasets: on the largest dataset, the copletion tie is alost 6 th that of SCAN. When considering earlier iterations, ISplit perfors even better, revealing the first 0, 50, and 00 segent approxiations within 5 seconds, 3 seconds, and 22 seconds, respectively, for all datasets, allowing users to get insights early by taking a sall nuber of saples. Copared to SCAN, the speed-up in revealing the first 0 features of the visualization is 2X, 22X, and 46X for the FL, T and WG datasets. When coparing ISplit to DPSplit, we first note that DPSplit is taking the sae saples as ISplit, so its differences in execution tie are priarily due to coputation tie. We see soe strange behavior in that while DPSplit is worse than ISplit for the early iterations, for copletion it is uch worse, and in fact worse than SCAN. The dynaic prograing coputation coplexity depends on the nuber of segents. Therefore, the coputation starts occupying a larger fraction of the overall wall-clock tie for the latter iterations, rendering DPSplit ipractical for latter iterations. Even for earlier iterations, DPSplit is worse than ISplit, revealing the first 0, 50, and 00 features within 7 seconds, 27 seconds, and 75 seconds, respectively, as opposed to 5, 3, and 22 for ISplit. As we will see later, this additional tie does not coe with a coensurate decrease in error, aking DPSplit uch less desirable than ISplit as an increentally iproving visualization generation algorith. 5.3 Increental Iproveent Evaluation We now copare the error for ISplit with RSplit and ISplit-Best. Suary: (a) The error of ISplit, RSplit, and ISplit-Best reduce across iterations. At each iteration, ISplit exhibits lower error in generating visualizations than RSplit. (b) ISplit exhibits higher correlation with ISplit-Best in the choice of split groups, with 0.9 for any N greater than RSplit has near-zero correlation overall. Figure 3 depicts the iterations on the x-axis and the l 2-error on the y axis of the corresponding segent approxiations for each of the algoriths for two datasets (others are siilar). For exaple, in Figure 3, at the first iteration, all three algoriths obtain the - segent approxiation with roughly siilar error; and as the nuber of iterations increase, the error continues to decrease. The error for ISplit is lower than RSplit throughout, for all datasets, justifying our choice of iproveent potential as a good etric to guide splitting criteria. ISplit-Best has lower error than the other two, this is because ISplit-Best has access to the eans for each group beforehand. We also explore the correlation of the split group order of ISplit and RSplit with ISplit-Best on varying the saple coplexity. Due to space liitations, the results can be found in our technical report [2] where we show that ISplit s split groups actually atch those fro ISplit-Best, even if the error is a bit higher. err(lk) ISplit RSplit ISplit-Best Iteration a. FL err(lk) ISplit RSplit ISplit-Best Iteration b. T Figure 3: l 2 squared error of ISplit, Rsplit, and ISplit-Best. 5.4 Releasing the Refineent Restriction We now copare ISplit with DPSplit and DPSplit-Best we ai to evaluate the ipact of the refineent restriction, and whether it leads to substantially lower error. Due to space liitations we again present the experients in our technical report [2]. Suary: Given the sae set of paraeters, DPSplit and ISplit have roughly siilar error; as expected DPSplit-Best has uch lower error than both ISplit and DPSplit. Thus, in order to reduce the drastic variation of the segent approxiations, while not having a significant ipact on error, ISplit is a better choice than DPSplit. 5.5 Optiizing for Error and Interactivity So far, we have kept the sapling paraeters fixed across algoriths; here we study the ipact of these paraeters. Specifically,

I ve Seen Enough : Incrementally Improving Visualizations to Support Rapid Decision Making

I ve Seen Enough : Incrementally Improving Visualizations to Support Rapid Decision Maing Sajjadur Rahman Maryam Aliabarpour Ha Kyung Kong Eric Blais 3 Karrie Karahalios,4 Aditya Parameswaran Ronitt Rubinfield