Generating example tuples for Data-Flow programs in Apache Flink

Size: px

Start display at page:

Download "Generating example tuples for Data-Flow programs in Apache Flink"

Estella Doyle
6 years ago
Views:

Electrical Engineering and Computer Science Database Systems

the requirements for the degree of Master of Science in

IT4BI at the Technische Universität Berlin July 31, 2015

1 Generating example tuples for Data-Flow programs in Apache Flink Master Thesis by Amit Pawar Submitted to Faculty IV, Electrical Engineering and Computer Science Database Systems and Information Management Group in partial fulfillment of the requirements for the degree of Master of Science in Computer Science as part of the ERASMUS MUNDUS programme IT4BI at the Technische Universität Berlin July 31, 2015 Thesis Advisors: Johannes Kirschnick Thesis Supervisor: Prof. Dr. Volker Markl

2 i Eidesstattliche Erklärung Ich erkläre an Eides statt, dass ich die vorliegende Arbeit selbstständig verfasst, andere als die angegebenen Quellen/Hilfsmittel nicht benutzt, und die den benutzten Quellen wörtlich und inhaltlich entnommenen Stellen als solche kenntlich gemacht habe. Statutory Declaration I declare that I have authored this thesis independently, that I have not used other than the declared sources/resources, and that I have explicitly marked all material which has been quoted either literally or by content from the used sources. Berlin, July 31, 2015 Amit Pawar

3 GENERATING EXAMLE TUPLES FOR DATA-FLOW PROGRAMS IN APACHE FLINK by Amit Pawar Database System and Information Management Group Electrical Engineering and Computer Science Masters in Information Technology for Business Intelligence Abstract Dataflow programming is a programming paradigm where a computational logic is modeled as a directed graph from the input data sources to the output sink. The intermediate nodes between sources and sink act as a processing unit that defines what action is to be performed on the incoming data. Due to its inherent support for concurrency, dataflow programming is a natural choice for many data-intensive parallel processing systems and is being used extensively in the current big-data market. Among the wide range of parallel processing platforms available, Apache Hadoop (with MapReduce framework), Apache Pig (runs on top of MapReduce in Hadoop ecosystem), Apache Flink and Apache Spark (with their own runtime and optimizer) are some of the examples that leverage dataflow programming style. Dataflow programs can handle terabytes of data and perform efficiently, but the target data of such large-scale introduces difficulties such as understanding the complete dataflow (what is the output of the dataflow or any intermediate node), debugging (it is impractical to track large-scale data throughout the program using breakpoints or watches) and visual representation (it is quite difficult to display terabytes of data flowing through the tree of nodes). This thesis aims to address these limitations for dataflow programs in Apache Flink platform using the concept of Example Generation, a technique to generate sample example tuples after each intermediate operation from source to sink. This allows the user to view and validate the behavior of the underlying operators and thus the overall dataflow. We implement the example generator algorithm for a defined set of operators, and evaluate the quality of generated examples. For ease of visual representation of the dataflow, we integrate this implementation with the Interactive Scala Shell available in Apache Flink.

4 Acknowledgements I would like to express my deep sense of gratitude to my advisor Johannes Kirschnick for his excellent guidance, scientific advice and constant encouragement throughout this thesis work. His apt assistance helped me understand and tackle many challenges during the research, implementation and writing of this thesis. I would like to thank Prof. Dr. Volker Markl and Dr. Ralf-Detlef Kutsche for their kind co-ordination efforts. I would also like to thank Nikolaas Steenbergen, Stephan Ewen and all the members of Flink dev user group for their kind advice and clarifications on understanding the platform. I express my sincere thanks to all the staffs and professors from IT4BI programme for their help and support during my Masters degree. Finally, I would like to thank all my friends from IT4BI programme, generation I and II, for their constant support and cherishable memories over the past two years. iii

5 Contents Abstract ii Acknowledgements iii List of Figures List of Tables vi vii 1 Introduction Motivation Goal Outline Dataflow Programming Apache Hadoop Apache Pig Apache Flink Comparison of Big-Data Analytics Systems Limitations of Dataflow Programming Example Generation Definition Dataflow Concepts Example Generation Properties Example Generation Techniques Downstream Propagation Upstream Propagation Example Generator Algorithm Example Generation In Apache Flink Implementation Operator Tree Single Operator Constructing the Operator Tree Supported Operators iv

6 Contents v 4.3 Equivalence Class Model Algorithm Overview Downstream Pass Upstream Pass Pruning Pass Flink Add-ons Using Semantic Properties Interactive Scala Shell Evaluation Performance Metrics Completeness Conciseness Realism Experiments Setup and Datasets Quality of Generated Examples Running Time Related Work 59 7 Conclusion Discussion and Future Work A Appendix 64 A.1 Word-Cound Example in different Dataflow Systems A.1.1 Apache Pig A.1.2 Apache Hadoop A.1.3 Apache Flink A.2 Sample Scala Script for Flink s Interactive Scala Shell A.3 Experiment s Dataflow Programs A.4 Flink-Illustrator Source Code Bibliography 72

7 List of Figures 3.1 Sample Dataflow that returns highly populated countries Sample Dataflow with output examples Dataflow Concepts Ideal set of output for sample dataflow Incomplete set of examples Single Operator Class Diagram Tree Construction Approaches Sample Dataflow to retrieve users visiting high page rank websites Downstream pass on Sample Dataflow Empty Equivalence class as a result of only Downstream pass Upstream Pass: Synthetic records introduction Upstream Pass: Synthetic records converted to Real records Pruning Pass: Start Pruning Pass: After pruning sink operator Pruning Pass: Complete pruning (Example 1) Pruning Pass: Complete pruning (Example 2) Forwarded Fields Annotations Completeness of Filter operator Realism of Filter operator Evaluation of Programs 1 to 6 and Evaluation of Programs 7 to Evaluation of Programs 10 to Flink Illustrator Runtime comparison vi

8 List of Tables 2.1 Comparison of Big-Data systems in context of Dataflow Programming Description of SingleOperator properties Locally created DataSet used for evaluation purpose Experiment Dataset s description and sizes vii

9 Chapter 1 Introduction Dataflow programming has gained popularity with the booming big data market over the past decade. It is a data processing paradigm that allows data movement to be the focal point, in contrast to the traditional programming (object-oriented or imperative or procedural), where passing control from one object to another forms the basis of the respective programming model. The distributed data processing framework leverages dataflow programming by splitting and distributing large dataset across different computing nodes, where each node performs dataflow operation on the local data. A single dataflow program can be seen as a directed graph from the source nodes to the sink nodes, where source nodes represent the input datasets consumed and sink nodes represent output datasets generated by executing the dataflow program. All the intermediate nodes, between source and sink, act as a processing unit that defines what action is to be performed on the incoming data. These actions can be categorized as either: i. General relational algebra operators (e.g., join, cross, project, distinct, etc.) or, ii. User-defined function operators (e.g., flatmap, map, reduce, etc.). Apache Hadoop [1] with its MapReduce [2] framework uses dataflow programming style. Apache Flink [3] is a distributed streaming dataflow engine that falls into a newer category of big data framework, an alternative to Hadoop s MapReduce, along with Apache Spark [4]. Other dataflow programming systems include Apache Pig [5], Aurora [6], Dyrad [7], River [8], Tioga [9] and CIEL [10]. This thesis is based on dataflow programs from Apache Flink, where we generate sample examples after each intermediate node (operator) from source to sink, allowing the Flink user to: 1

10 Chapter 1. Introduction 2 1. View and validate the behavior of the underlying set of operators and thus understand and learn the complete dataflow 2. Optimize the dataflow by determining the correct set of operators to achieve the final output 3. Monitor iterations in case of an iterative KDDM (Knowledge discovery and Data Mining) algorithm 4. Understand the behavior of User-defined functional (UDF) operators (seen as a blackbox) 1.1 Motivation Dataflow programming, similar to any other programming paradigm, is an iterative/incremental process. In order to program a final correct version of the dataflow, the user may be subjected to more than one iteration. Each iteration consist of steps like; coding a dataflow, building and executing it on the respective system and finally analyzing the output or the error log to determine whether the dataflow resulted in the expected outcome, if not, the user revises the steps (normally via debugging) and repeats with a new iteration. Any programming paradigm when dealing with a large-scale dataset results in a longer execution/testing time and same applies to the dataflow programming. Hence, the iterative model of development when handling the large-scale data is a time consuming process and, therefore, inefficient. The whole process can be made efficient if the user can verify the underlying execution of the dataflow, i.e., verifying execution at each node (operator) in the dataflow. This can be done by checking the dataset that is being consumed and generated at any given operator, this in turn allows the user to pin-point and rectify the logic or error (if any). Visualizing the dataflow with example dataset after each operator execution will allow the user to test the necessary assumptions made in the program, in a way, it voids the need of debugging using breakpoints and watches. The overall process of example generation after each operator permits the user to learn about the logic of the operators as well as the complete dataflow program. Dataflow program is a tree of operators from source to sink, where one or more sources (leaves) converge into a single sink (root) via different intermediate nodes.

11 Chapter 1. Introduction 3 In order to get an optimal performance from the program, choosing the appropriate operator type is of utmost importance. For example the Join operator in Apache Flink can be executed in multiple ways such as Repartition or Broadcast, depending on the input size and order. With these options, the user can select the best operator to optimize the overall performance. Similarly, the user can decide to replace Join or any costly operator, with a transformation or aggregation or other cheaper operators as long as the final objective is accomplished. Thus dataflow programming can be seen as a plug-n-play like model, where the user can plug/unplug (add/remove) operators then play (execute) the complete dataflow in order to decide on the most optimal set of operators for the final version of the program. In such scenario, having a concise set of examples (instead of large datasets) that completely illustrates the dataflow would be of a great help to the user as it avoids repetitive cost and time heavy execution of the program after each plugging or unplugging. Dataflow programming frameworks such as Flink and Spark are well suited for the implementation of Machine-Learning algorithms for Knowledge Discovery and Data Mining (KDDM). KDDM in itself is an iterative process, where a model (classification, clustering, etc.) is repetitively trained for better prediction accuracy, for e.g., the k-means clustering algorithm, an iterative refinement process, is used in many data mining or related applications. In k-means algorithm, given the input of k clusters and n observations, the observations are iteratively alloted to the most appropriate cluster based on the computation of the cluster mean. This problem is considered to be NP-hard and hence training such predictive models can be a cumbersome process, where constant fine-tuning (via re-sampling or dataset feature changes or mathematical re-computation) is needed. The overall modeling process in a dataflow environment can be complicated [11] and is pictured as a blackbox, because the user has very less clue about how the newly tuned dataflow might behave and has to wait throughout the process execution, only after that the user can verify the output to take the necessary action. Having sample examples that demonstrates the dataflow quickly will be of a great help to the user, as it opens up the blackbox and provides a quick efficient way for fine tuning the dataflow and in turn the predictive model. Out of several operators available in the dataflow programming, UDF operators pose a problem of readability/understandability to a new user (one who is new to the system or code), as it might be difficult to guess what is the purpose of the

12 Chapter 1. Introduction 4 respective UDF. It can be a grouping function or a transformation function or a partition function, in order to interpret a UDF operator, one needs to investigate at the code level, which can be a tedious task for a new user. Instead, if the same user has access to input consumed and the output generated by the UDF operator, it will be easier to reason the behavior of that very operator. The idea of presenting examples after each operator has been realized and is being used for testing and diagnostics purpose in Apache Pig. Apache Pig, a high-level dataflow language and execution framework for parallel computation, runs on top of MapReduce framework in Hadoop ecosystem. Pig features a Pig Latin language layer that simplifies MapReduce programming via its set of operators. One of the diagnostic operators include ILLUSTRATE 1, that displays sample examples after each statement in a step-by-step execution of a sequence of statements (a dataflow program), where each statement represents an operator. It allows user to review how data is transformed through a sequence of Pig Latin statements (forming a dataflow). This feature is included in a testing and diagnostics package of Pig, as it allows the user to test the dataflows on small datasets and get faster turnaround times. In this thesis, we enhance the understanding of large-scale dataflow program execution by introducing a new feature in Apache Flink which helps the user by providing sample examples at operator level, similar to the ILLUSTRATE function in Apache Pig. This thesis work, like Apache Pig, is based on an example generator [12]. The algorithm used in example generator works by retrieving a small sample of the input data and then propagating this data through the dataflow, i.e., through the operator tree. However, some operators, such as JOIN and FILTER, can eliminate these sample examples from the dataflow. For e.g., an operator performing a join on two input datasets A(x,y) and B(x,z) on common attribute x (a join key), if both A and B contains many distinct values for x, initial sampling at A and B may lead to a possibility of unmatched x values. Hence, join may not produce an output example due to the absence of common join key in both datasets. Similarly, filter operator executed on a sample dataset might produce an empty result if no input example satisfies the filtering predicate. To address such issues, the algorithm used in [12] will generate synthetic example data, that allows the user to examine the complete semantics of the given dataflow program. 1

13 Chapter 1. Introduction Goal This thesis concentrates on dataflow programs from Apache Flink. Our main goal is to implement ILLUSTRATE like example generator feature of Pig in Flink, such that it will help users to tackle the problems that are faced with respect to dataflow programming in a big-data environment. As this thesis is inclined towards implementation of a new feature in Flink, we try to answer following questions: 1. What are the challenges in implementing the example generator and ways to tackle them? 2. How can the concept of example generator be realized in Flink? What inputs are required for the implementation and how they can be obtained from Flink? 3. What extra features of Flink can be exploited in order to differentiate it from other implementations? 4. How well does the implemented algorithm performs with respect to the metrics defined? 1.3 Outline The outline lays out the approach that we followed for the thesis that subsequently helped us to find the answers of the questions mentioned in the previous section. A comprehensive study on dataflow programming and the systems that take the advantage of this paradigm (Chapter 2) A broad introduction to the example generation problem, challenges and the approach towards the solution (Chapter 3) Implementation of the example generator algorithm in Flink, with extensive description of the input construction, various algorithm steps and the Flink features used (Chapter 4)

14 Chapter 1. Introduction 6 Experimenting and evaluating the performance of the implemented algorithm against the defined metrics by executing the implemented code on different dataflow programs and diverse datasets, covering the operators as well as the stages of the algorithm (Chapter 5) A brief overview of the related works in the field of example generation (Chapter 6) We conclude by discussing the results and findings (Chapter 7)

15 Chapter 2 Dataflow Programming Dataflow programming introduces a new programming paradigm that internally represents applications as a directed graph, similar to a dataflow diagram [13]. Program is represented as a set of nodes (also called as blocks) with input and/or output ports in them. These nodes act as source, sink or processing blocks to the data flowing in the system. Nodes are connected by directed edges that define the flow of data between them. It is reminiscent to the Pipes and Filters software architectural model, where Filters are the processing nodes and Pipes serves the passage for the data streams between the filters. One of the main reasons why dataflow programming surged with the hype of big-data is, its by default support for concurrency and thus allowing increased parallelism. In a dataflow program, internally, each node is an independent processing block, i.e., each individual node can process on its own once they have their respective input data. This kind of execution allows data streaming, an intermediate node in the dataflow starts functioning as soon as the data arrives from previous node and transfers the output to the next node. Hence, this avoids the need of waiting for the previous node to finish its complete execution. In a big-data environment, the aforementioned characteristics, parallelism and streaming, allows dataflow programs to be run on a distributed system with computer clusters. Such system splits the large dataset into smaller chunks and distributes it across cluster nodes, each node then executes the dataflow on their local chunk. The sub-result produced after execution at all the nodes are combined to get the final result for the large dataset. This is the main gist on how 7

16 Chapter 2. Dataflow Programming 8 a dataflow program is executed on big-data system with a distributed processing environment. In this section, we discuss different big-data analytics systems that take the advantage of the dataflow programming paradigm. 2.1 Apache Hadoop Apache Hadoop [1] is a big-data framework that allows distributed processing of large-scale dataset across clusters of computers using a simple programming models. Hadoop leverages dataflow programming via MapReduce module. Hadoop MapReduce is a software framework for writing applications and process vast amount of data (multi-terabyte datasets) in-parallel on large clusters in a reliable, fault-tolerant manner. A typical MapReduce program consist a chain of mappers (with Map method) and reducers (with Reduce method) in that order. These Map and Reduce methods are second order functions that consume input dataset (either in whole or part of large dataset) and a user defined function (UDF) that is applied to the corresponding input data. In a MapReduce program, the mappers and reducers form the processing nodes of the dataflow with sources and sinks (of respective formats) explicitly mentioned. Irrespective of being able to scale to very large datasets, previous studies have reported the limitations of Hadoop MapReduce and issues that can be related to dataflow programming are listed as follows: Lack of support for dedicated relational algebraic operators such as Join, Cross, Union and Filter [14, 15]. These operators are frequently used in many iterative algorithms, for e.g., Join is an important operator in a PageRank algorithm. This restricts the user to custom code the logic for handling the above mentioned common operations and the programmer needs to think in terms of map and reduce to implement them. Lack of inherent support for iterative programming [16], an integral feature of any dataflow programming framework as well as useful for many data analytics algorithms. For e.g., k-means algorithm, an iterative clustering process of finding k clusters. The workaround for iteration in MapReduce is to have

17 Chapter 2. Dataflow Programming 9 an external program that repeatedly invokes mappers and reducers. But, this comes with an added overhead of increased IO latency and serialization issues with no explicit support for specifying termination condition. The above limitation of relation operators is addressed by Hadoop through another module, Apache Pig. 2.2 Apache Pig Apache Pig is a high-level dataflow language and execution framework for parallel computation in Apache Hadoop. Apache Pig [5] was initially developed at Yahoo to allow Hadoop users to focus more on analyzing dataset rather than to invest time on writing complex code using the map and reduce operators. Internally, Pig is an abstraction over MapReduce, i.e., all Pig scripts are converted into Map and Reduce tasks by its compiler. Thus, in a way Pig makes programming MapReduce applications easier. The language for the platform is a simple scripting language called Pig Latin [17], which abstracts from the Java MapReduce idiom into a form similar to SQL. Pig Latin allows the user to write a dataflow that describes how the data will be transformed (via aggregation or join or sort) as well as develop their own functions (UDFs) for reading, processing and writing data. A typical Pig programs consist of following steps: 1. LOAD the data for manipulation. 2. Run the data through a set of transformations. These transformations can be either a relation algebra transformation (join, cross, filter, etc.) or an user defined function. (All the transformations are internally translated to Map and Reduce tasks by the compiler.) 3. DUMP (display) the data to the screen or STORE the results in a file. When relating a Pig program to a dataflow; LOAD operators are the source nodes, the set of transformations forms the processing nodes and DUMP or STORE are the sink nodes.

18 Chapter 2. Dataflow Programming 10 Pig addresses the limitations of MapReduce by providing a suite of relational operators for the ease of data manipulation [5]. Though, in order to achieve iterations as well as other control flow structures (if else, for, while, etc.) one needs to use Embedded Pig, where Pig Latin statements and Pig commands are nested into scripting languages such as Python, JavaScript or Groovy. This reduces the simplicity of programming (one of the major selling point) in Pig, as it introduces JDBC-like compile, bind, run model that adds extra overhead of complex invocations. As Pig translates all the statements and commands from the scripts into MapReduce tasks, it is considered to be slower than a well-written/implemented MapReduce code 1. Although Pig offer better scope for optimization than MapReduce, an optimized Pig script can perform on par with MapReduce code. 2.3 Apache Flink Apache Flink (formerly known as Stratosphere) [3] is a data processing system and an alternative to Hadoop s MapReduce module. Unlike Pig, that runs on top of MapReduce, Flink comes with its own runtime, a distributed streaming dataflow engine that provides data distribution and communication for distributed computations over data streams. It features powerful programming abstractions in multiple languages, Java and Scala, thus providing the user different language options to program a dataflow. Flink also supports automatic program optimization, allowing user to focus more on other data handling issues. It has native support for iterative programming via iteration operators such as bulk iterations and incremental iterations. Also, it provides support for program consisting of large directed acyclic graphs (DAGs) of operations. One of the essential components in Flink framework is the Flink Optimizer [18], that provides automatic optimization for a given Flink job as well as offers techniques to minimize the amount of data shuffling thus formulating an optimized data processing pipeline. Flink Optimizer is based on a PACT [19] (Parallelization Contract) programming model that extends the concepts from MapReduce, but is also applicable to more complex operations. This allows Flink to extend its support for relational operators such as Join, Cross, Union, etc. The output of the Flink optimizer is a compiled and optimized PACT program, which is nothing but 1

19 Chapter 2. Dataflow Programming 11 a DAG-based dataflow program. This is how dataflow programming is achieved in Apache Flink. A typical Flink program consists of the same basic steps: 1. Load/Create the initial data. 2. Specify transformations of this data. 3. Specify where to put the results of your computations, and 4. Trigger the program execution. In Step 1 we mention the source nodes for the dataflow program, Step 2 transformations forms the processing nodes and in Step 3 we mention the sink nodes. Flink s runtime natively supports iterative programming [20], the feature lacking in MapReduce and not easily accessible in Pig, through its different types of iteration operators: Bulk and Delta. These operators encapsulate a part of the program and execute it repeatedly, feeding back the result of one iteration into the next iteration. It also allows to explicitly declare the termination criterion. Such an iterative processing system makes Flink framework extremely fast for data-intensive and iterative jobs compared to Hadoop s MapReduce and Apache Pig. 2.4 Comparison of Big-Data Analytics Systems The sample dataflow program, Word-Count (reads text from files or string variable and counts how often words occur), implementation is presented in Appendix for each framework, Hadoop MapReduce A.1.2, Pig A.1.1, and Flink A.1.3, in order to observe the respective programming technique and to get an idea about the efforts needed to code the same. As shown in Table 2.1, Apache Flink has a clear advantage over the other two systems, with its faster processing times and native support for relational operators as well as iterative programming. This makes Flink a certain choice when implementing a dataflow program for a large-scale dataset.

20 Chapter 2. Dataflow Programming 12 Apache Hadoop (MR) Apache Pig Apache Flink Framework MapReduce MapReduce Flink optimizer and Flink runtime Language Supported Java, C++, etc. Pig-Latin Java, Scala, Python(beta) Dataflow Nodes Processsing Nodes Ease of Programming Relational Operators Iterative Programming Mappers and Reducers Suite of operators, UDFs Suite of Operators Simple but tricky when implementing join or similar operators Not natively supported, implemented via MR Not natively supported Simple Natively supported Supported via Embedded Pig Simple Processing Time Fast Slow Fastest Natively supported Natively supported Table 2.1: Comparison of Big-Data systems in context of Dataflow Programming 2.5 Limitations of Dataflow Programming As discussed in [13], the major limitations in a dataflow programming paradigm are visual representations and debugging. In a big-data environment, with large-scale data flowing, these limitations are indeed more severe [21]. It is quite impractical to visually represent terabytes of data flowing through the tree of operators, and denoting what action is being performed on that data at each operator. For tracking an error in a program, the user often introduces breakpoints and watches to monitor the flow of execution with changes in the variable values or the output data. Although this approach of monitoring is futile when it comes to data of such vast size. This thesis work tries to address these limitations in Apache Flink environment. For a Flink dataflow program, we generate concise set of example data at each node in the operator tree such that it allows the user to validate the behavior of the operator as well as the complete dataflow. It eliminates the need of debugging (via breakpoints, watches, etc.) to an extent, as the user can diagnose the error (after seeing the flow of sample examples in the dataflow) by locating the problematic operator in the tree and rectifying the logic at that very operator. Visual representation of a dataflow is made available in Flink via its Web-Client interface or Flink job submission interface, though the flow of data is not a part of this representation. This thesis work integrates with the Flink s Interactive Scala Shell, a new feature in Flink, to display the set of examples for the dataflow

21 Chapter 2. Dataflow Programming 13 program executed via the interactive shell. Thus making the generated examples visually accessible to the users. This way we address the limitations of a dataflow programming paradigm in a big-data system, Apache Flink. In next section, we define the Example generation problem, explain the different approaches to generating a set of concise examples that allows the user to reason the complete dataflow.

22 Chapter 3 Example Generation In this chapter, we describe the problem of generating example records for dataflow programs and mention the already existing approaches for the same, followed by the challenges faced and their drawbacks. We also describe the theoretical terms and concepts used throughout this thesis, with a brief introduction to the algorithm that betters the drawbacks in the existing approaches. 3.1 Definition Dataflow program is a directed graph G=(V,E) where V is the set of nodes denoting operators (source, data transformation/processing node, sink) and E is the set of directed edges denoting the flow of data from one operator to the next. Example generation is a process of producing/generating a set of concise examples after each operator in the dataflow, such that it allows a user to understand the complete semantics of the dataflow program. Let us demonstrate this concept using an input dataflow example that returns a list of highly populated countries. Figure 3.1: Dataflow that returns highly populated countries 14

23 Chapter 3. Example Generation 15 Figure 3.1 is a dataflow that LOADs two datasets Countries (ISO Code, Name) and Cities (Name, Country, Population 1 ) and performs a JOIN on the attribute name/country (Countries/Cities). The joined result is then grouped by countries and aggregation (SUM) is performed on population. Later we filter out the countries with population less than and equal to 4 million and present the list as output. Example generation when applied to above dataflow from Figure 3.1, gives us the output, shown in Figure 3.2, where we have sample examples after each operator to facilitate the understanding of the operators as well as the complete dataflow for a user. Figure 3.2: Sample output of Example Generation on dataflow from 3.1 This allows user to verify the behavior of each operator just by looking at the output sample, for e.g., verifying whether aggregation is having an intended effect or filter is behaving correctly. Instead of cross-checking the whole logic, user will now be able to target only the faulty operators. Figure 3.1 and Figure 3.2 are respectively considered as the input and the output of any example generation algorithm. Let us now briefly explain the terms used throughout this thesis with respect to the dataflow example from Figure in million

24 Chapter 3. Example Generation Dataflow Concepts Source: Load operators are the sources in the above dataflow, as they read the data from the input files/tables/collections. Sink: Operators that produce (by storing in a file or displaying) the final result is the Sink. Filter is the Sink in the above dataflow. Downstream pass: Starting from Sources we move in the direction toward Sink, i.e., from Loads to Filter. Upstream pass: Starting from Sink we move in the direction toward Sources, i.e., from Filter to Loads. Downstream operator/neighbor: If Operator1 consumes the output of Operator2, it makes Operator1 as the downstream neighbor of Operator2. Join is the downstream operator of both the Loads, similarly Group is the downstream neighbor of Join and Filter is of Group. Upstream operator/neighbor: If Operator1 consumes the output of Operator2, it makes Operator2 as the upstream neighbor of Operator1. Group is the upstream neighbor of Filter, similarly Join is of Group and Loads are of Join. Operator Tree: The complete chain of operators from sources to sink makes an operator tree. The operator tree is the representation of the given dataflow program in term of operators used. All the concepts can be visualized as shown in Figure 3.3 Figure 3.3: Concepts related to a dataflow program

25 Chapter 3. Example Generation Example Generation Properties Let us now briefly explain the properties that define a good example generation technique. Completeness The examples generated at each operator in the dataflow should collectively be able to illustrate the semantics of that operator. For e.g., in above dataflow output Figure 3.2, the examples before and after the Filter operator clearly explains the semantics of filtering by population, as examples with low population are not propagated to the output. Same can be said with respect to Group operator, the examples illustrate the meaning of grouping of countries and aggregating (via sum) the population. Completeness is the most important property, as it allows user to verify the behavior of each and every operator in the dataflow. Conciseness The set of examples generated after each operator should be as small as possible, such that it minimizes the effort of examining the data in order to verify the behavior of that operator. The output presented in Figure 3.2 cannot be considered as concise, as there is a scope to explain the behavior of the operators will smaller set of examples, as shown in Figure 3.4. Realism An example is considered to be real if it is present in the respective source file/table/collection. Thus, the set of examples generated by any of the operator must be the subset of the examples in the source, in order to be considered as a real set. In the dataflow output, shown in Figure 3.2, assume that all the examples generated by Load operators are from the respective source files, this means all the examples generated in the output 3.2 are real. In Figure 3.4, we have altered the output of the dataflow (Figure 3.2), to make it more concise and thus ideal with respect to the properties defined.

26 Chapter 3. Example Generation 18 Figure 3.4: Ideal set of examples satisfying all properties 3.4 Example Generation Techniques In this section, we discuss the techniques/approaches that can be considered for example generation and the problems related to each approach and how we can overcome the same Downstream Propagation The most simple way to generate examples at each operator in the dataflow is to sample few examples from the source files and push these examples through the operator tree, executing each operator and recording the results after each execution. The problem with this approach is, not always the sampled examples would allow all operators in the operator tree to achieve total completeness. For e.g., considering our dataflow example from Figure 3.1, initial sampling might produce examples that cannot perform join and thus resulting the lack of completeness (as shown in Figure 3.5). This can indeed be averted if we increase the sampling size and push as many possible examples through the operator tree [22], but this violates the property of conciseness. As we seek the approach that generates complete and concise set of examples, relying only on downstream propagation cannot be the choice for our implementation.

27 Chapter 3. Example Generation 19 Figure 3.5: Incompleteness is often the case in downstream propagation Upstream Propagation The second approach that can be considered for example generation is to move from sink to source and generate the output examples based on the characteristics of the operator (joined examples, crossed examples, or filtered examples). Based on these output we generate the input and propagate upstream recursively till we reach the sources. This approach will work fine if we are aware of the operator behavior (join, union, cross, etc.) but fails when a UDF operator is part of the operator tree. A UDF being a blackbox, it is complex to predict the behavior beforehand, this approach fails to generate examples resulting incompleteness. The only way to generate examples for these UDF operators is to push data through them in the downstream direction. Hence, both approaches Downstream and Upstream propagation individually is not efficient to generate an ideal set of examples satisfying both completeness and conciseness. Nonetheless, combination of both downstream and upstream with pruning of redundant examples can lead to the better set of results Example Generator Algorithm The algorithm with steps (in that order): 1. Downstream Pass, 2. Pruning, 3. Upstream Pass, 4. Pruning

28 Chapter 3. Example Generation 20 was proposed in [12] and was observed to be more efficient (to generate complete and concise set of examples) compared to only Downstream and only Upstream approaches. To ensure the completeness of all the operators in an operator tree, [12] introduces the Equivalence Class Model. Equivalence Class Model For a given operator O, a set of equivalence classes ε O = {E 1, E 2,..., E m } is defined such that each class denotes one aspect of the operator semantics. For e.g., a FILTER operator s equivalence set will be denoted by ε filter = {E 1, E 2 }, where E 1 denotes the set of examples passing the filtering predicate, and E 2 denotes the set of examples failing the filtering predicate. In section 4.3 we define equivalence classes for the set of supported operators. Following is the pseudo-code of the algorithm [12] used in this thesis for example generation. Algorithm 1 Example Generator Step 1 : use downstream pass to generate possible sample examples at each operator in the operator tree Step 2 : using equivalence class model check for the operator completeness, and for incompleteness (if any) identified, generate sample examples using upstream pass Step 3 : use pruning pass to prune/remove the redundant set of examples from the operators Let us now formally introduce the example generation problem with respect to Apache Flink that we are addressing in this thesis. 3.5 Example Generation In Apache Flink As discussed in section 1.2, we implement the ILLUSTRATE 2 feature available in Apache Pig in a completely distinct computing platform of Apache Flink. This work is based on the paper [12], where they describe and prove the best possible 2

29 Chapter 3. Example Generation 21 way to generate a set of complete, concise and real examples for a given dataflow program. In the implementation, we have considered few notable differences between Pig and Flink, and thereby defined our requirements, 1. Flink is a multi-language supporting framework (with support for Java, Scala and Python), in contrast to Pig that uses a textual scripting language called Pig Latin. In our implementation, we focus on Flink s Java jobs. 2. Pig allows ILLUSTRATE either on a complete dataflow script or after any given dataset (also known as a relation) from the script. We have adapted this accordingly to support only a complete Flink job; i.e., a dataflow from source to sink. We are of the opinion that illustrating after the complete dataflow is more feasible than that of after every dataset because eventually all used datasets would be displayed in a complete job illustration. 3. Pig s input is transformed into a relation, a relation is a collection of tuples and is similar to that of a table in a relational database. In Flink the input can be of any format (txt, csv, hdfs, collection), although for our implementation we need to convert these inputs into tuple format before performing any transformation on the input, as our intention is to find example tuples. These were the differences which we observed and accordingly adapted it in our implementation. In order to view the overall implementation picture let us now briefly explain the basic life-cycle when the flink illustrator, the module that generates sample examples, is invoked by a flink job: 1. For a submitted job, an operator tree is created. 2. This operator tree forms the main input for our algorithm, different passes then acts on the input to generate complete, real and concise examples. 3. These generated examples are then displayed to the user (via a console or scala shell). After an overview on example generation concepts and the introduction to algorithm as well as equivalence class model, next section provides the detailed explanation of the implementation with respect to Apache Flink, the process of

30 Chapter 3. Example Generation 22 generating the input (Operator Tree), followed by the list of supported Flink operators and their respective Equivalence classes, and each algorithm pass. Later, we mention the Flink features that have been integrated with the implementation, making it different and novel.

31 Chapter 4 Implementation After having defined what is example generation and how it is useful in Apache Flink, this chapter extends to give an in-depth proceeding on how this concept is realized in Flink. This topic can be best treated under following headings: 1. Operator Tree 2. Supported Operators 3. Equivalence Class 4. Algorithm 5. Flink add-ons 4.1 Operator Tree Operator tree is a chain or list of single operators that starts from the input sources and ends at the output sink with one or more data processing/transforming operators between the two. Operator tree is created once the submitted flink job invokes the illustrator module. Before going into further details how we build an operator tree for a given job, let us define the Single Operator object, which is at the granular level of an operator tree. 23

32 Chapter 4. Implementation Single Operator Single Operator object defines different properties of the operator under consideration. The idea behind formalizing our own operator object rather than using the one available from Flink 1 was, i. We were not going to use all the properties of flink operator class, as they were not adding any value to our implementation, e.g., the degree of parallelism and compiler hints. ii. There were few things missing in the available flink operator class which were needed specifically for our implementation purpose, e.g., having join keys easily available for the given input, having the output of each operator for further manipulation, and having details that verify the operator produced an output for the set of inputs were necessary. Our single operator class along with the properties can be viewed in Figure 4.1. SingleOperator equivalenceclasses: List<EquivalenceClass> operator: Operator<?> parentoperators: List<SingleOperator> operatoroutputaslist: List<Object> syntheticrecord : Tuple semanticproperties : SemanticProperties JUCCondition : JUCCondition operatorname: String operatortype : OperatorType operatoroutputtype : TypeInformation<?> JUCCondition firstinput : int secondinput : int firstinputkeycolumns: int[ ] secondinputkeycolumns: int[ ] operatortype : OperatorType Figure 4.1: The Single Operator Class 1 flink/api/java/operators/operator.html

33 Chapter 4. Implementation 25 Property equivalenceclasses operator parentoperators Description To define the completeness of the output set Flink operator object Set of input operators whose output is being consumed by the current operator operatoroutputaslist Output examples produced by the current operator (stored as a List collection ) syntheticrecord semanticproperties JUCCondition operatorname operatortype operatoroutputtype Synthetic record constructed to ensure operator completeness Defines the transformations on tuple fields in the UDF operators An inner class that facilitates manipulation of details related to join, union and cross operators, such as providing the input number as well as the assigned key columns Name of the current operator The type of the operator (JOIN, LOAD, CROSS, etc.) Defines the structure and datatypes of the output produced Table 4.1: Description of SingleOperator properties The properties for a given SingleOperator object are explained in the Table Constructing the Operator Tree After defining what our custom operator class consists of, let us now see the two different ways to construct a list of these custom operators that constitutes our main input, operator tree. For better understanding let us consider a sample dataflow program, as shown in Figure 4.2, that takes 2 input Load operators then performs Join on it and Projects some result as an output. In this scenario, Load1, Load2 forms our input sources and Project is our output sink. Figure 4.2: Tree Construction Approaches 1. Top-Down Push Approach: This is the approach used in our final implementation. In this method, we start from the source operator and then move downstream towards the sink output

34 Chapter 4. Implementation 26 operator. If we consider the dataflow program in Figure 4.2, we would get two source operators, Load1 and Load2, in the order they are defined in the program. Suppose Load1 is defined first, we construct Load1 operator s SingleOperator object and add it to the operator tree, we then seek for the downstream operator of Load1 and thus get the Join operator. As Join is a dual-input operator (similar to Cross, Union), we make sure both of its parents are added to the tree before adding the Join operator itself. This allows us better handling of outputs and execution of a dual- input operator. For e.g., if we add and execute in the order Load1, Join, P roject, Load2 Join will have input only from one parent operator, Load1, and the execution will halt. Hence, we make sure Load2 s SingleOperator object is added to the tree before adding Join s. Using the above technique we continue moving downstream until the sink is reached, and thus we get our complete operator tree for a given sample dataflow program. This method can be seen as following pseudo code 2: Algorithm 2 Top-Down Push Method Step 1 : let List<SingleOperator> be an operatortree Step 2 : get all the input sources for a given dataflow program Step 3 : get first input from the input sources Step 4 : construct SingleOperator object for the received input and add it to the operatortree Step 5 : get the downstream operator of the added input and check its property, whether it is a single-input operator or a dual-input operator, if single-input operator goto Step 6.1, if dual-input operator goto Step 6.2 Step 6.1 : construct SingleOperator object for this downstream operator and add it to the operatortree Step 6.2 : check if both the parents are present in the operatortree, if present then construct SingleOperator object for this downstream operator and add it to the operatortree, if not get the next input from the input sources and goto Step 4 Step 7 : continue from Step 5 till we have reached the sink, i.e., the added operator has no downstream operator

35 Chapter 4. Implementation Bottom-Up Volcano Approach This can be an alternate technique to create an operator tree. As the name suggests, this approach constructs the operator tree starting from the sink, i.e., an output operator and then moving upstream towards the input source operators. Once a flink job for the above sample dataflow program (Figure 4.2) is submitted, we seek the sink operator, in this case, the sink operator is Project. We construct the single operator object for Project operator by setting the necessary details and then add it to the operator tree. We then seek for the input/parent operators of Project, and get the Join operator. We continue in similar fashion until there is no input for the operator, that means we have reached the source operator, resulting in the completion of the operator tree for a given sample dataflow program. This whole process can be seen using following pseudo code 3: Algorithm 3 Bottom-Up Volcano Method Step 1 : let List<SingleOperator> be an operatortree Step 2 : get the sink for a given dataflow program Step 3 : construct SingleOperator object for the sink and add it to the operatortree Step 4 : get the upstream operators for the object added to the operatortree Step 5 : construct SingleOperator object for every upstream operator and add it to the operatortree Step 6 : continue from Step 4 to 5 till we have reached the sources, i.e., we don t find any upstream operator for the added operator 4.2 Supported Operators After seeing the techniques to construct an operator tree, we define the type of operators that are allowed and supported in our implementation and also look into different conditions when an operator may not produce any output. LOAD : Is the first operator in the operator tree and acts as a source for the operators downstream. It will always produce records unless the input

36 Chapter 4. Implementation 28 source file or collection is empty. Load in Flink can also be achieved via FlatMap User defined function (UDF) operator, that allows us to convert input into requested tuple input format. DISTINCT : This operator de-duplicates the records present in the upstream operator. Unless the upstream operator generates empty set of tuples, Distinct will always produce output records. JOIN : Is a dual-input operator, that combines the records from the upstream parent operators on a common join key. If there does not exist a matching field on the join key in the parent operators output, Join will produce an empty output. Join in Flink behaves differently compared to traditional SQL Join, in Flink it will join the tuples from both inputs to form a single tuple of size 2 (tuple from input 1, tuple from input 2), whereas in SQL the tuples will be merged as one tuple. Figure 3.2 shows the behavior of an SQL Join, and Figure 4.4 depicts Flink Join behavior. UNION : Is a dual-input operator, that combines the outputs of the upstream parent operators. Here the output type of both parent operators must be same, i.e., they must produce same type of tuple fields. Unless both the parent operators generates the empty result set, Union will always produce output records. CROSS : Is also a dual-input operator, that produces a cross product of the records from the upstream parent operators, i.e., building all pairwise combination of the records from both parent operators output. Cross will produce empty output if any of the upstream operator s result set is empty. FILTER : Is a User defined function (UDF) operator in Flink, which takes a filtering predicate and applies it on the upstream operator s result set. If none of the upstream operator s result set passes the predicate, Filter s result will be an empty set. PROJECT : Is an operator that projects the mentioned fields from the tuples. Unless the upstream operator s record set is empty, Project will always have an output result set. In our implementation, we are limited to these set of operators from the plethora of Flink operators that include many other UDF operators because, having these

37 Chapter 4. Implementation 29 above set of operators gives us the overall idea of how the example generation works in Flink. As of now, Filter, FlatMap and Map are the only Flink UDF operators included in this implementation. Although, after defining the fitting semantics, it is possible to extend the implementation to include the other not-supported operators. The list of Flink operators not supported in this implementation is: GroupReduce, GroupCombine, CoGroup, Aggregation, Iteration, Partition The reason behind the exclusion of the above operators is, one needs to define exact semantics for each operator and most of the operators being a UDF (where the functionality is defined at the code-level), it is difficult to characterize the same. We were able to include FlatMap and Map UDF operators in this implementation because, structural transformation of the input tuple is the only semantic that one has to take into account, in contrast to Grouping or Iterations. In order to maintain completeness, i.e., to avoid situation where an operator generates an empty set and to cover all the necessary semantics of an operator, we adapt the model stated in the paper [12] known as Equivalence Class Model, which is described in next section. 4.3 Equivalence Class Model In this model, we assign each operator from the operator tree a list of equivalence classes, such that each equivalence class depicts a particular unique aspect of the operator semantics. Let us formulate the above concept; for a given operator O, ε O = {E 1, E 2,..., E m } is the set of equivalence classes, and each E i specifies and illustrates one aspect of the operator semantics. When the sample examples are generated for an operator either as input or output, we make sure that each generated example belongs to an equivalence class. Also we must cover all the equivalence classes for the given operator by having at least one example for each equivalence class in the result set. This way we can achieve the semantic completeness for the operator under consideration. Let us now define the equivalence classes for the operators that we have considered:

38 Chapter 4. Implementation 30 LOAD : Every record from the source file or collection that is being loaded forms the single equivalence class E 1. This can also be stated as, every record that Load generates is assigned to a single equivalence class E 1. DISTICNT : Every output record is assigned to a single equivalence class E 1. The behavior can best described when there exist at least one duplicate record in the input set, but this may not always be possible as it is dependent on the output generated from its upstream operator. JOIN : Every output record is assigned to a single equivalence class E 1. This illustrates that, there exists a case of matching join key in parent operators output set and thereby clarifying the semantics. UNION : Every input record from first parent operator is assigned to E 1, and every input record from second parent operator is assigned to E 2. This allows us to confirm that Union produces tuples having at least one record from each parent operator. CROSS : Similar to Union, every input record from first parent operator is assigned to E 1, and every input from second parent operator is assigned to E 2. This way we confirm that both parents have an input, for Cross to produce any output. FILTER : Every input record that passes the filtering predicate is assigned to E 1, whereas the one that fails is assigned to E 2. This way we have at least one record that passes the filter and Filter produces an output, and we also depict the type of examples that are blocked by the filtering predicate. PROJECT : Every output record is assigned to E 1, this verifies that Project had appropriate input to produce an output. 4.4 Algorithm In this section, we describe our implementation algorithm for generating the example tuples for each operator in the operator tree. Our algorithm needs two main inputs, 1. An operator tree for a submitted Flink job (which is created using the technique seen above 4.1.2), 2. The base tables, i.e., the input source files or the collections, depending on the input format used in the dataflow program. The

39 Chapter 4. Implementation 31 algorithm is generic with respect to input formats and supports different input formats used by Flink, such as csv file, text file, java collection, or HDFS file. Every input record is then converted to a tuple structure, as they form the basis for the data movement through the operator tree. With respect to operators support, we are limited to the set of operators mentioned above, because of the reasons discussed in the earlier Section 4.2. Let us quickly overview the different steps we perform throughout out algorithm in next section Overview After receiving the inputs of an operator tree for a given Flink dataflow program and the input sources, we perform in total three different passes in order to generate the final concise, complete set of examples. The algorithm can be summarized as: 1. Downstream pass: In this pass we move from source operators to sink, considering one operator at a time. In source operators, we randomly sample n records from the input sources and propagate these records to the downstream operator. The downstream operator then acts on the records to generate an intermediate output, which is then passed to the next downstream operator. We continue in similar fashion till we reach the sink operator. After reaching the sink, we set the equivalence classes for each operator in the tree using the examples generated. Based on the initial sampled records, a downstream operator may or may not generate an example. To tackle the incompleteness, case of empty equivalence class, we have the next pass. 2. Upstream pass: In this pass we move from sink to source operators, we check at each operator whether there exist any incompleteness, by identifying empty equivalence classes. In case of incompleteness, we create a synthetic record, such that it resolves the incompleteness for the operator under consideration. We then propagate this synthetic record till the source operator and convert it to a concrete real example. We continue in similar fashion until the source is reached. 3. Pruning pass: After resolving the incompleteness issue, we tackle conciseness in this pass. There may exist some records that may be redundant, in a way that they do not add any extra value semantically for the operator

40 Chapter 4. Implementation 32 under consideration. We eliminate such records in this pass, along with their trail throughout the tree. In next sections, we explain each pass in very detail. For better understanding, let us consider an example dataflow program and use its operator tree and sources as the inputs to our algorithm. Figure 4.3 is a dataflow program that provides the list of users that tends to visit urls with high page rank. Figure 4.3: Sample example - Dataflow program that retrieves users visiting pages with high page rank In above running example, we load two datasets, UserVisits and PageRanks, using respective source files. Then both the datasets are joined on field url, these joined records are then filtered where only records having rank greater than three are accepted. Later, the filtered records are projected in a tuple format of <user, url, rank> thereby stating the high-ranked url for that user. Once the flink-illustrator module is invoked the operator tree is created, for our running example, the tree will be same as Figure 4.3, i.e.,{ Load UserVisits, Load PageRanks, Join, Filter, Project} as per our approach Top-Down Push algorithm 2. As the input is ready, our algorithm invokes the downstream pass Downstream Pass In our downstream pass, we randomly select n records from each input source, which forms the result set of the Load operators. These records are then propagated to the downstream operator, that acts on the received input records and generates the output, which in turn will act as an input to the next downstream operator and continue in similar manner till we reach the sink. Consider our operator tree as a list of operators {O 1, O 2, O 3,..., O m }. Given the order here O 1 s output will be the input of O 2, operator O 2 will then evaluate on the received set of inputs and possibly generate an output, this output thus becomes an input for

41 Chapter 4. Implementation 33 O 3. This way we continue till we reach O m, that by definition is the sink operator, thereby completing the downstream pass. We store all the intermediate results produced by each operator. Lets apply this concept on running example. Figure 4.4: Downstream pass applied on our running example 4.3 After the downstream pass, the example generated for each operator can be viewed as in Figure 4.4. As stated, Loads will have randomly selected inputs from the source files. Join operator will receive these records as inputs and then evaluate the join on the url, the joining key, there are two common urls present namely youtube and f acebook, this allows Join to produce outputs by joining the respective tuples from the parent operators. Next downstream operator, Filter, will accept only those records with rank greater than three, while discarding the rest. Finally, Project, the sink operator, creates a tuple format as per the requested structure. Record s Lineage During the downstream pass, we track the flow of each individual record for its lifetime in the operator tree, this tracking of an individual record is known as a Lineage of that record. For e.g., lineage of the record <Amit, facebook, 5pm>, moving in the direction of source to sink will be constructed as follows: Left hand side denotes the operator traversal in the operator tree (and the effect each operator has on the tuple s structure), whereas right hand side denotes the record s traversal through the operator tree and shows the effect of evaluation on it after the respective operator.

42 Chapter 4. Implementation 34 Operator traversal Record s traversal Load Load <Username, URL, Time> <Amit, facebook, 5pm> Join UserV isits with P ageranks on URL Join <<Username,URL,Time>,<URL, Rank>> <<Amit,facebook,5pm>,<facebook,4>> F ilter Rank > 3 F ilter <<Username,URL,Time>,<URL, Rank>> <<Amit,facebook,5pm>,<facebook,4>> P roject Username, URL, Rank P roject <Username, URL, Rank> <Amit, facebook, 4> Hence the lineage of record <Amit, facebook, 5pm> is: <Amit, facebook, 5pm> <<Amit, facebook, 5pm>, <facebook, 4>> <<Amit, facebook, 5pm>, <facebook, 4>> <Amit, facebook, 4> Similarly, lineage of record <youtube,2> will be: <youtube, 2> <<Amit, youtube, 3pm>, <youtube,2>> Having these lineage will in turn let us form the lineage group, which is the technique used for our pruning pass. Lineage Group Lineage group for a given record adds some extra information to the available lineage, such as, the operator that acts on the record and the output it generates. For e.g., the lineage group of <Amit, facebook, 4pm> will be as follows: <Amit, facebook, 4pm> {Load,<Amit, facebook, 4pm>} {Join,<<Amit, facebook, 5pm>, <facebook, 4>>} {Filter,<<Amit, facebook, 5pm>, <facebook, 4>>} {Project,<Amit, facebook, 4>} These lineage groups are then used for pruning in the pruning pass section. After the completion of the downstream pass, we examine the generated example set at each operator and set the equivalence classes for the respective operator. The complete downstream pass can be summarized into following pseudo code 4:

43 Chapter 4. Implementation 35 Algorithm 4 Downstream Pass Step 1 : let operatortree be the input Step 2 : set the LOAD operators records by randomly selecting n values from the respective sources, store these records as the output for LOAD operators Step 3 : initiate lineage groups for every record from the above outputs as <record <Load, record>> Step 4 : get the next downstream operator from operator tree, evaluate the operator on the available input, store the intermediate output, add <record <operator, record s evaluation on this operator>> to the record s respective lineage group Step 5 : continue Step 4 until the sink is reached Step 6 : set the equivalence classes for each operator based on the output generated In the above running dataflow example, each operator generated an output and also every equivalence class for each operator were filled. For e.g., Join had the joined example hence filling the E 1, Filter had both equivalence classes filled, E 1 with 2 records passing and E 2 with one record failing to pass the filtering predicate. This may not always be the case, for instance, consider the situation as shown in Figure 4.5 for the same dataflow program but with slight modification, we remove the Filter operator for simplicity. This type of scenarios might occur in any example dataflow, to tackle this we then move towards next Upstream pass Upstream Pass In this pass, we move from sink towards the source inspecting one operator at a time. At each operator, we identify possible cases of incompleteness and address this by introducing synthetic records that might boost the completeness for the operator under consideration. We identify incompleteness based on the equivalence classes set in the downstream pass, if for a given operator O, any of the equivalence class is empty we introduce a synthetic record in O s input set such that, after evaluation it makes that respective equivalence class not empty. After introducing the synthetic record, we propagate it upstream until the respective source is reached, and there we then convert this synthetic record to an actual

44 Chapter 4. Implementation 36 Figure 4.5: Unmatching join-key attribute leads to an empty result at the Join operator record by matching it as closely as possible with data available at the source. In the following section, we explain how we generate a synthetic record for an operator s equivalence class. Synthetic Record is a concise representation of the tuple that the operators consumes or produces based on the equivalence class we are trying to fill. Based on the operator under consideration, we explicitly define whether a field in the tuple is of certain significance. For e.g., in case of a Join operator we mention whether a field is the join key or not. For the given set of operators that we consider in our implementation, our language to define synthetic record is very simple: i. if the field is a join key, add the marker Join key to the respective field position in the tuple, ii. all the remaining fields will be marked as Dont care (with a hint of field s datatype) that means any value is acceptable in this position. This could for sure be extended to support richer set of operators for e.g., in the case of Filter operator stating the appropriate field in the tuple is the filtering predicate or in the case of FlatMap operator stating that this field is transformed. But as these operators are UDF operators in Flink, retrieving such information is not yet possible. Indeed semantic properties of an operator is useful in this

45 Chapter 4. Implementation 37 scenario, but it cannot be generalized to the overall implementation, as it is an optional property for an operator and may not always be set. Now let us explain how this works, with respect to the example in Figure 4.5, as we can see there is no common join key in both Loads, thereby resulting Join to produce an empty result and hence the equivalence class E 1 is empty. As we traverse from the sink to source we first come across Project operator, as project just being an operator that displays result in a mentioned structure, we know that having some records in its input will solve its incompleteness and hence ignore it, because if Join produces the result, for sure Project will too. Thus, we move upstream and reach at Join operator and find E 1, i.e., any joined example to be empty. Hence we now attempt to introduce records in Join s input such that it will resolve the incompleteness. As we know, url being the join key, our synthetic records will be as follows: Load (UserVisits) < Dont care, Join key, Dont care > Load (PageRanks) < Join key, Dont care > Having the above set of records as an input to Join operator, will resolve its incompleteness as we get the resulting joined set as << Dont care, Join key, Dont care >, < Join key, Dont care >> and progressively of Project s as well, as shown in Figure 4.6. We then move these synthetic records upstream till the source and convert them to records with real matching data values taken from the respective source files, these new data values are obtained from the examples that were not used in the Downstream pass, as show in Figure 4.7, < John, linkedin, 6pm > and < linkedin, 4 >, unused in downstream pass, are the records introduced to boost completeness.

46 Chapter 4. Implementation 38 Figure 4.6: Upstream Pass: Synthetic records introduction and assumed operator evaluation Figure 4.7: Upstream Pass : Synthetic records to Real records

47 Chapter 4. Implementation 39 This way our upstream pass attempts to resolve any incompleteness present, the whole process can be seen as a single iteration from introducing synthetic records, converting them to real records and evaluating the operators once again. Join, Cross or Union operators, we can fill empty equivalence class within a single iteration, but when it comes to Filter operator one iteration may not always be ideal because, to find a specific record that either satisfies or fails the filtering predicate (depending on the equivalence class to fill) is a time exhausting process when it comes to big datasets. For This is also due to the fact, that we are not exposed to the filtering predicate as it being a UDF in Flink, hence manipulating the synthetic record to our benefit is not possible. Having said that, we use the same method for Filter as we use for other operators to generate the best possible record in a single iteration. Given the input from the downstream pass the upstream pass can be viewed as per the following pseudo code 5, where we move from sink to source examining one operator at a time. Algorithm 5 Upstream Pass Step 1 : let O be the operator under consideration (initially O will be the sink operator) Step 2 : check for the empty equivalence classes in O, if exist any empty equivalence class goto Step 3.1, if all equivalence classes are filled get the upstream operator and goto Step 1 Step 3.1 : introduce synthetic records in O s input set such that it makes the equivalence class non-empty, propagate this record in upstream direction till the source Step 3.2 : at source convert this synthetic record to a closely matching real record from the source s input Step 4 : evaluate the operators again, store the newly added records, update the respective lineage groups, discard the synthetic records Step 5 : set the equivalence classes for the operators based on the newly available records Step 6 : get the next upstream operator (if any) and continue from Step 1 After tackling the incompleteness issue in this upstream pass, we now move towards making the result set as concise as possible by pruning the generated examples. This is done in our last pass of Pruning.

48 Chapter 4. Implementation Pruning Pass In this pass, given the set of examples generated by each operator after above two passes, we reduce the example set for every operator possible such that it does not affect the completeness of that operator. This is done using the concepts of record lineage and lineage groups, that are set in downstream and upstream passes. We start from the sink operator, consider its example set and try removing one example at a time. If the removed example keeps the operator s completeness intact, we then use the record s lineage group and attempt to remove the trace of that record throughout the tree. While pruning, we make sure that the completeness is being not compromised at any given operator in the tree. We then move upstream, continue in similar fashion at each operator, making sure that removing record at this upstream operator does not impact the completeness of the current operator as well as the downstream operators, till the source operator is reached. Let us now explain this concept and how it is realized with respect to examples from Figure 4.4 and 4.7. Let us start with example from Figure 4.4 (recreated here for reference) Figure 4.8: Pruning pass on example from 4.4: Initial State Here the sink operator is Project, we consider its output and see two examples. Project being an operator where every output record is assigned to a single equivalence class E 1, hence to maintain Project s completeness having just one record at E 1 would suffice. This allows us to prune one of the two examples from the Project s output set {< Hung, facebook, 4 >, < Amit, facebook, 4 >}. Let us try

49 Chapter 4. Implementation 41 to prune record < Hung, facebook, 4 >, this record is a part of two lineage groups as follows: Lineage Group 1: < Hung, facebook, 4pm > {Load, < Hung, facebook, 4pm >} {Join, << Hung, facebook, 4pm >, < facebook, 4 >} {F ilter, << Hung, facebook, 4pm >, < facebook, 4 >} {P roject, <Hung, facebook, 4>} Lineage Group 2: < facebook, 4 > {Load, < facebook, 4 >} {Join, << Hung, facebook, 4pm >, < facebook, 4 >} {Join, << Amit, facebook, 5pm >, < facebook, 4 >} {F ilter, << Hung, facebook, 4pm >, < facebook, 4 >} {F ilter, << Amit, facebook, 5pm >, < facebook, 4 >} {P roject, <Hung, facebook, 4>} {P roject, < Amit, facebook, 4 >} Considering Lineage Group 1, as we move from sink to source; we see at Project, eliminating < Hung, facebook, 4 > does not affect its completeness and thus we are allowed to prune the example at the sink-project operator. Next upstream operator is Filter, here too, removing: << Hung, facebook, 4pm >, < facebook, 4 >> has no impact on Filter s completeness as both of its equivalence classes E 1 (passing filter: << Amit, facebook, 5pm >, < facebook, 4 >> ) and E 2 (failing filter: << Amit, youtube, 3pm >, < youtube, 2 >>) are still filled, and Project s completeness is intact as well. Thus << Hung, facebook, 4pm >, < facebook, 4 >> is pruned at the filter operator. Similar is the case at Join and Load (UserVisits) operator and the respective record is pruned at these operators as well. We hence pruned the complete trace of < Hung, facebook, 4 > throughout the operator tree using Lineage group 1. After the first round of pruning, Lineage group 2 will now look as follows:

50 Chapter 4. Implementation 42 < facebook, 4 > {Load, < facebook, 4 >} {Join, << Hung, facebook, 4pm >, < facebook, 4 >} {Join, << Amit, facebook, 5pm >, < facebook, 4 >} {F ilter, << Hung, facebook, 4pm >, < facebook, 4 >} {F ilter, << Amit, facebook, 5pm >, < facebook, 4 >} {P roject, <Hung, facebook, 4>} {P roject, < Amit, facebook, 4 >} As the remaining records at Join, Filter and Project are not related to: < Hung, facebook, 4 >, we cannot prune them. But we are allowed to prune < facebook, 4 > at Load, we attempt to remove it but realize that it will disturb the completeness of the downstream operator Filter, as there won t be any record passing the filter and will make E 1 void. Hence < facebook, 4 > is retained at Load (PageRanks). After finishing pruning at the sink operator (Project), the resulting operator tree with examples will look as in Figure 4.9, we then move to next upstream operator Filter. Figure 4.9: Pruning pass: After pruning Project operator At Filter, it is not possible to prune the only available record from its output, as it would make Filter s, as well as Project s, result set empty. Moving further upstream to Join operator, here the output set consists of two records, << Amit, f acebook, 5pm >, < f acebook, 4 >> and << Amit, youtube, 3pm >, < youtube, 2 >> at Join it is possible to prune either one of the records without

51 Chapter 4. Implementation 43 affecting Join s completeness, but pruning any of these records will reduce completeness at downstream Filter operator because, pruning << Amit, facebook, 5pm >, < facebook, 4 >> will make Filter s E 1 (passing records) void. Pruning << Amit, youtube, 3pm >, < youtube, 2 >> will make Filter s E 2 (failing records) void. Hence at Join, similar to Filter, no record is pruned. Applying the similar logic to the remaining upstream operators Load (UserVisits) and Load (PageRanks), at Load (UserVisits) we can prune < June, gmail, 5pm > and at Load (PageRanks) we can prune < yahoo, 3 > and < skysports, 2 > as it does not upset the completeness of respective Load s as well as any downstream operators. So the overall operator tree along with examples, after pruning pass completion will look as shown in the Figure Figure 4.10: Pruning pass: Complete pruning of example from Figure 4.4 Using the similar technique, the pruning pass on example from Figure 4.7, would produce result set as shown in the Figure The complete pruning pass can be summarized as per the pseudo-code 6.

52 Chapter 4. Implementation 44 Figure 4.11: Pruning pass: Complete pruning of example from Figure 4.7 Algorithm 6 Pruning Pass Step 1 : let O be the operator under consideration (initially O will be the sink operator), for every record R in O s result set perform Step 2 Step 2 : verify whether pruning R would result in incompleteness at O, if completeness is intact proceed to Step 3, else get next R (if any) from O s result set and redo Step 2 Step 3 : get R s lineage group l, for every lineage entries in l execute Step 4 Step 4 : get record R li (record at operator i in lineage group l) initially i will be the farthest downstream operator, verify whether pruning R li affects completeness at operator i, as well as the completeness of i s downstream operators (if any). If completeness is intact everywhere remove R li from the operator i, move to next lineage entry record R li 1 (if any) where i 1 is the upstream operator of operator i and redo Step 4 (now R li = R li 1 ) Step 5 : repeat Step 1 through 4 for the next upstream operator until we reach the source These were the main passes used in our example generation algorithm. In our next section we will explain the extensible features of Flink used in our implementation.

53 Chapter 4. Implementation Flink Add-ons Using Semantic Properties Flink s optimizer provides a way to inspect UDFs, normally seen as a black box, by presenting information such as location of the fields of the input tuples that are being accessed, modified or copied. Hence providing an indirect access to data movement in a UDF. To access this information, one needs to add Function Annotations 2 to a UDF. This can be done in a manner similar to shown in Code- Snippet 4.1 as ForwardedFields ("f0.f0 ->f0;f0.f1 ->f1;f1.f1 ->f2") public static class PrintResultWithAnnotation implements FlatMapFunction < Tuple2 < Tuple2 < String, String >, Tuple2 < String, Long >>, Tuple3 < String, String, Long >> { // Merges the tuples to create tuples of <User, URL, PageRank > public void flatmap ( Tuple2 < Tuple2 < String, String >, Tuple2 < String, Long >> joinset, Collector < Tuple3 < String, String, Long >> collector ) throws Exception { } } collector. collect ( new Tuple3 < String, String, Long >( joinset.f0.f0, joinset.f0.f1, joinset.f1.f1)); Listing 4.1: Demonstrating ForwardedFields Usage The class PrintResultWithAnnotation, a FlatMap UDF operator, takes input tuple of the format << String, String >, < String, Long >> corresponding to << User, URL >, < URL, P agerank >> and returns a new tuple of the format < String, String, Long > corresponding to < User, URL, P agerank >. If we try to paint a picture of what is going inside the FlatMap function, it can be viewed as shown in Figure 4.12: 2 programming guide.html#semantic-annotations

54 Chapter 4. Implementation 46 Figure 4.12: Inside picture of the FlatMap function in 4.1 This is exactly what is mentioned using ForwardedFields annotation, that can be seen just before the class definition starts. It is stated as f0.f0 -> f0 ; f0.f1 -> f1 ; f1.f1 -> f2 ) input tuple {}}{ f0 f1 {}}{{}}{ < < } User {{}, } URL {{} >, < } URL {{}, P agerank > > }{{} f0 f1 f0 f1 f1 f0 {}}{{}}{{}}{ < User, URL, P agerank > }{{} output tuple f2 To explain this annotation in detail, we need to understand the input tuple structure, << User, URL >, < URL, P agerank >> this overall is a Tuple2 representation, where field0 is < User, URL > and field1 is < URL, P agerank >; let us for simplicity call them main-field0 and main-field1. These main fields, both 0 and 1, are in turn Tuple2 representations, for < User, URL >, User is field0 and URL is field1, similarly with < URL, P agerank >, URL is field0 and PageRank is field1. In expression f0.f1 -> f1, left hand side f0.f1 denotes URL which is field1 in main-field0, and is just copied or say forwarded to f1, i.e., the location field1 in the output tuple of < User, URL, P agerank >. This demonstrates how functional annotations allow us to drill into UDF s internal operation. In our implementation, we have used and exploited this exclusive feature of Flink with respect to FlatMap UDF operator. This indeed can be further extended to other UDFs, given we have the annotation information available. Static Code Analysis (SCA) [23], is a new feature in the latest Flink release. SCA implicitly generate the same type of semantic information that we gain using Function Annotations. The user can thus avoid explicit mentioning of annotations in order to realize the UDF behavior. Our implementation can further leverage from

55 Chapter 4. Implementation 47 this additional feature of static code analysis, as we would then be exposed to input output mapping of all the UDF operators. We can then extend our implementation to other UDFs and can be considered as a part for future enhancement Interactive Scala Shell Interactive Scala Shell for Flink is the latest addition to the Flink features, which allows command line execution of Flink scala code, and evaluating them per line basis. It provides a very good platform for our implementation to display the example result sets generated using the algorithm for a given dataflow program. We hence have added an extra command :ILLUSTRATE to the existing list of Shell standard commands, upon receiving this command we call our flink-illustrator module, and display the sample examples in the shell. To use this command we need to provide two arguments namely, Flink s execution environment and the sink operator or sink operator s dataset. As our implementation is limited to Flink Java jobs, to execute using scala shell we have to make the necessary adjustments and use respective Java API modules. In Appendix A.2, we have a sample scala code adapted to Java API module and the usage of ILLUSTRATE command. In this chapter, we have explained the details related to our algorithm implementation as well as the necessary extensions that one can use from Flink. The next chapter describes the experiments and evaluations conducted on the algorithm.

56 Chapter 5 Evaluation This chapter encompasses one of the main objectives of this thesis, i.e., to evaluate the performance of the algorithm with respect to key metrics namely: Completeness: Verifies that the examples generated at the operator covers all the semantics of that operator, Conciseness: Verifies that the example set is as small as possible, to allow quick overview of the operator functionality and the dataflow, Realism: Verifies that the examples are indeed from the input sources, and Runtime: Verifies that the example generation indeed is faster compared to the complete dataflow execution. In following sections, we explain the definition of the each metric, the experimental setup and workload (the dataflow programs) used, and finally we measure the performance of our algorithm (with respect to the above metrics) against two other example generation approaches: Downstream Propagation : Sinlgle downstream pass from the source to sink. Sink-Pruning : Pruning only at the final sink operator. 48

57 Chapter 5. Evaluation Performance Metrics Completeness The completeness of an operator O is defined using the concept of equivalence classes. It is defined as the ratio of number of equivalence classes with at least one example to the number of equivalence classes for the given operator. This can be formulated as: [ # O s equivalence classes with atleast one example Completeness O = # O s equivalence classes ] 1 0 [This value will always range between 0 and 1, 1 denotes total completeness] For example, Filter operator has two equivalence classes for its input records namely, E 1 - for records passing the filtering predicate E 2 - for records failing the filtering predicate Figure 5.1: Completeness of Filter = 0.5 Consider a scenario where Filter s input only has records that pass the filter, as shown in Figure 5.1 where operator accepts records with age above 18. From the available set of examples E 1 is filled and E 2 is empty, i.e., only one out of two equivalence classes has at least one example; so by substituting the values in the above formula we get 1/2. Using the above technique, we define per-operator completeness for all the operators in a given dataflow program. The overall completeness of a given dataflow program P can then be defined as the average of per-operator completeness of all the operators in P. Completeness P = Completeness O1 + Completeness O Completeness Om m

58 Chapter 5. Evaluation Conciseness The conciseness of an operator O is defined as the ratio of the number of operator s equivalence classes to the total number of example records for that operator (with a ceiling at 1). Conciseness O = [ ] # O 1 s equivalence classes # O s example records 0 [This value will always range between 0 and 1, 1 denotes that operator has produced most concise set of examples] Using the example from Figure 5.1, Filter generates two examples and has two equivalence classes, hence the conciseness for this operator is: 2/2 = 1, which means that the records generated are indeed concise, but not complete. Given the above method to evaluate per-operator conciseness, the overall conciseness of a dataflow program P can be defined as the average of per-operator conciseness of all the operators in P. Conciseness P = Conciseness O1 + Conciseness O Conciseness Om m Realism The realism for an operator O can be defined as the fraction of the number of records that are real over to the total number of records generated by the same operator. A record is considered as real if it comes from the source operators or is derived via different operations from the source operators. Operator may or may not consist synthetic records in its output set, having these synthetic records reduces realism. Realism O = [ ] # real records in O 1 s output example set # records in O s output example set 0 [This value will always range between 0 and 1] And similarly the overall realism for a dataflow program P can be seen as: Realism P = Realism O1 + Realism O Realism Om m

59 Chapter 5. Evaluation 51 Considering the same example as in Figure 5.1, the realism at Filter is 1 as there exist no synthetic records. Suppose during the upstream pass a synthetic record < P QR, 17 > is added to Filter s input to boost its completeness as shown in Figure 5.2. Figure 5.2: Realism of Filter after boosting completeness = 2/3 Due to the addition of the synthetic record, we indeed increase the completeness of Filter operator but at the expense of the reduction in realism. Given the nature of this problem, there is no solution that can guarantee full completeness with full realism, either of them is compromised. Similar scenario may appear in a real world dataflow program, and one cannot guarantee the score of 1 on each completeness, conciseness and realism. Our algorithm favors completeness over realism, as it is more important to determine the complete behavior of an operator. 5.2 Experiments Setup and Datasets In this section, we evaluate our dataflow programs with respect to above performance metrics. Before diving into the details, let us briefly explain the experimental environment used. The experiments were conducted on a local Linux box running Ubuntu with 4 cores and 12 GB RAM. The Apache Flink s version was used in our implementation. Most of our experiments were conducted with local text files as input sources, while some using HDFS as its storage layer with Hadoop version We are using datasets constructed locally to test our implementation that covers all the possible operators with different scenarios for complete algorithm coverage.

60 Chapter 5. Evaluation 52 We try to introduce empty equivalence classes by maneuvering the input files accordingly, this allows us to verify the completeness aspect of the operators and the overall dataflow program. The dataset details can be seen in Table 5.1. DataSet Details Format #rows UserVisits Log of user s visit to web-sites during the day UserVisitsEU Log of EU user s visit to websites during the day csv and hdfs file (Username, URL, Visit TimeStamp) csv and hdfs file (Username, URL, Visit TimeStamp) PageRanks Contains pagerank of the respective website csv (URL, Rank) 50 JobSeeker Contains user details csv (UserId, Username) 50 Jobs Contains job and salary details csv (JobId, City, Salary) 20 JobsEU Contains job and salary details in EU csv (JobId, City, Salary) 20 Table 5.1: Locally created DataSet used for evaluation purpose For thorough evaluation, we define in total 13 dataflow programs that use the variations of above datasets as input. These dataflow programs can be found in Appendix A.3. We categorize these 13 programs into following types (based on type of evaluation): 1. Single-operator: Dataflow contains only one data transformation operator (P rograms 1 to 6 and 13). 2. Multi-operators: Dataflow contains multiple data transformation operators. (P rograms 7 to 9). 3. Empty Equivalence classes: Dataflow inputs are manipulated such that it results in empty-set or empty equivalence classes for the operator under test. (P rograms 10 to 12) Quality of Generated Examples We evaluate the above programs using three different approaches against the metrics defined; completeness, conciseness and realism. The approaches used are: 1. Our Algorithm: The algorithm from our implementation, with downstream, upstream and full pruning passes.(4.4) 2. Downstream: In this approach we propagate the sample example sets from source LOAD operators till the sink, while evaluating at each operator in the operator tree. This approach allows us to validate the completeness metric.

61 Chapter 5. Evaluation Sink-level Pruning: This is a variant of our algorithm, with pruning limited to only sink operator rather than at each operator in the operator tree. This approach allows us to validate conciseness metric. We ran three iterations of each program (in order to compensate for the initial sampling), in Figures 5.3,5.4,5.5 Y-axis thus denotes the average of the metrics, while X-axis denotes the algorithm types. The idea behind multiple executions and thus evaluation of each program was, initial sampling at LOAD might produce examples that cannot always be consumed by the downstream operators to generate an output, resulting incompleteness. For e.g., two different LOADs producing examples with no common join key that is assigned at JOIN, or LOAD producing no examples that can pass the filtering predicate at FILTER. Having multiple runs thus helped us to verify whether our algorithm is efficient enough to tackle the incompleteness (if any) being introduced. Let us analyze the experimental observations. 1. Single-operator dataflows: With respect to the completeness of the dataflows containing only one data transformation operator, our algorithm achieves the highest score of 1 for all the 7 programs, i.e., algorithm illustrates all the necessary semantics of the operators. In contrast, Downstream approach fails to attain total completeness in more than one occasion, at Join (due to the unavailability of common join-key), at Filter (due to missing record that passes/fails the filtering predicate). Our algorithm managed to tackle such incompleteness introduced. As stated earlier, we emphasize completeness over realism, from Figure 5.3 (a/e/f), it is clear that we introduce a synthetic record in order to address Join incompleteness and, therefore, compromising on realism. Compared to other approaches, examples generated by our algorithm are more concise and attains the highest score, whereas Downstream and Sink-Pruning struggles in the range of Thus, it is quite clear that our algorithm produces the most complete and concise set of examples for the single-operator dataflow validation.

62 Chapter 5. Evaluation 54 Figure 5.3: Evaluation of Single-Operator dataflows (a) Join, (b) Union, (c) Cross, (d) Filter, (e) Join with Project, (f) Join with Project using Annotations, (g) Join using HDFS

Chapter 5. Evaluation 55 2. Multi-operator dataflows Figure 5.

4 for Multi-Operator dataflows, our algorithm attains total completeness in 2 out of 3 programs.

63 Chapter 5. Evaluation Multi-operator dataflows Figure 5.4: Evaluation of Multi-Operators dataflow (a) Union, Join, Project (b) Union, Filter, Join, Project (c) Union, Cross As seen in the above Figure 5.4 for Multi-Operator dataflows, our algorithm attains total completeness in 2 out of 3 programs. Program 8 being a multi-operator dataflow with FILTER, the upstream pass may not always be successful to fill the empty equivalence class. This can indeed be rectified using multiple iterations until the equivalence class is filled but at the expense of extra overhead of long running time as well as increased cost for pruning (each iteration might add redundant record). Program 7 and 8, containing UNION followed by JOIN in the operator tree, results in more than one joined examples in some scenarios and hence the lack of conciseness. Given that, the conciseness is still higher than the other two approaches. For multi-operator dataflows, the set of examples produced by our algorithm may not always be totally complete, but it indeed fares better compared to only downstream approach.

Chapter 5. Evaluation 56 3. Empty Equivalence Classes: Figure 5.

were intentionally altered or were kept empty so that it would result in incompleteness during the downstream pass.

64 Chapter 5. Evaluation Empty Equivalence Classes: Figure 5.5: Evaluation of Special cases with intentional manipulation of inputs to create empty equivalence classes at (a) Join, (b) Union, (c) Cross For the special cases of Program 10-12, the input sources were intentionally altered or were kept empty so that it would result in incompleteness during the downstream pass. The algorithm still managed to achieve total completeness at the expense of zero realism, because only synthetic records traversed through the dataflow. As we can see in Figure 5.5, compared to our implemented algorithm, downstream approach achieved very low scores on completeness, as there were no records in the input sources to completely perform either join (in 10), union (in 11) or cross (in 12). Again, our approach managed to produce the concise set of operators with highest score of 1 on every occasion. To summarize the evaluation of the quality of examples generated, our algorithm approach achieves total completeness on 12 out of 13 programs. Lack of realism denotes the fact that the downstream pass resulted in incompleteness and irrespective of that the algorithm achieved the highest completeness. The examples generated are always concise and achieved the score of 1 in 11 out 13 programs. The conciseness is higher than the other two approaches because, Downstream approach does not perform any pruning, whereas in Sink-Pruning it is limited only

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive