Implementing the speed layer in a lambda architecture

Size: px

Start display at page:

Download "Implementing the speed layer in a lambda architecture"

Kelley Harvey
5 years ago
Views:

1 Implementing the speed layer in a lambda architecture IT4BI MSc Thesis Student: ERICA BERTUGLI Advisor: FERRAN GALÍ RENIU (TROVIT) Supervisor: OSCAR ROMERO MORAL Master on Information Technologies for Business Intelligence Universitat Politècnica de Catalunya Barcelona July 31, 2016

2 A thesis presented by ERICA BERTUGLI in partial fulfillment of the requirements for the MSc degree on Information Technologies for Business Intelligence

3 Abstract With the increasing growth of big data and the necessity of near real-time analysis, it appeared the requirement of a solution able to conciliate existing ETL batch processing methods with newly developed stream-processing methodologies designed to obtain a view of online data. Lambda architecture [1] is the data-processing architecture that enables the possibility of exploiting both batch- and stream-processing methods in the attempt of balancing latency, throughput, and accuracy. The project aims to demonstrate how lambda architecture can be a good approach for a company to enable near real-time analytics integrating the existing batch processes with a streaming solution. ii

4 Contents List of Figures List of Tables v vi 1 Introduction Structure of the work Current solution Data storage Batch view Data warehouse Data flow Limitations of the current solution Related work 7 4 Design of the solution Requirements Architecture Messaging system Speed layer Batch layer Serving layer Implementation Speed Layer Technology Design Invalid clicks detection Serving layer Synchronization Dump of the materialized views in the data warehouse Exception handling Non-blocking exceptions Blocking exceptions Fault-tolerance Monitoring and guarantee of satisfaction of the requirements iii

5 6 Experiments and discussion Speed layer Serving layer Experiment results Interest of the user Comparison with current solution Conclusion and future work 27 Appendices 29 A Logline 30 B Agile Methodology 32 C Additional views for the data warehouse 35 D Survey 37 D.1 First survey D.2 Second survey References 38 iv

6 List of Figures 2.1 UML Class Diagram of loglines Current solution Batch flow Batch flow Lambda architecture from a high-level perspective [8] Frameworks benchmarking (figure from [9]) Architecture proposed Sync view in the serving layer BPMN of the speed layer Stateful transformation for invalid clicks detection BPMN of the serving layer Speed layer: experiments with ad impression loglines Ad impressions - Adding cores Ad impressions -Adding executors Speed layer: experiments with click loglines Clicks - Adding executors Clicks - Adding memory Experiments with view source revenues View source revenues - adding cores View source revenues - adding memory View source revenues - adding executors B.1 Weekly sprints B.2 Product backlog C.1 Experiments view hit zone monthly C.2 Experiments view source revenues hourly v

7 List of Tables 2.1 Characteristics of the data Characteristic of the data flow Project requirements Experiment results Comparison of the logline injection Comparison of the serving layer vi

8 Chapter 1 Introduction The company hosting the Master Thesis, Trovit [2], was founded in 2006 and have headquarters in Barcelona. Trovit is a vertical search engine for classified ads, specialized in four verticals: jobs, cars, real-estates and products. The search engine is currently available in 46 different countries, it has 140 million of indexed ads and more than 93 million visits per month. The aim of the website is to make available to its users all the ads available in the web, aggregating several sources. Users can search the ads, refine the result using filters, set up regular alerts, or set up notifications in the mobile app. Ads displayed to the user are crawled from websites or directly obtained through partnerships with the sources. Trovit business model is based on the following elements: pay per click: the partner website pays each time a user clicks on its ad (and is consequently redirect to the source website); pay per conversion: Trovit is paid each time that a click is converted to a concrete action in the partner website, for example a contact request; Google AdSense: Trovit is paid for each advertisements displayed using AdSense technology; web banners: additional space for advertisements in third websites is sold to partner websites. For this reason, it is extremely important for the company to always have accurate and current data about the clicks, the impressions, and all the events generated on the website and on the mobile application to be able to correctly monetize them. 1.1 Structure of the work The following chapters will describe the solution already existing in the company (Chapter 2) and the related work (Chapter 3). Chapter 4 will introduce the requirements of the project as well as the solution proposed with its architecture. Chapter 5 will detail the implementation while Chapter 6 will describe and discuss the experiments performed. Lastly, Chapter 7 will present the conclusions of the work. 1

9 Chapter 2 Current solution Currently Trovit has an ETL flow that processes the events coming from the web to produce internal statistics. The data involved are generically named loglines and are related to the analysis of the usage of the company website as well as the mobile application and the alerts. Alerts are s that are requested by the user in order to stay updated with new ads of interest. The process involves 15 ETL flows that process respectively 15 different types of loglines. The following are the type of events analyzed: ad impression: every impression in Trovit web pages. click: when a user goes to a partner s website through Trovit. internal click: clicks whose destination is Trovit itself (not the one monetized). conversion tracking: information about clicks that have been converted to a concrete action on the partner website. banner impression: every banner Trovit shows on foreign websites. alert creation impression: every popup where a user is offered to create an alert. alert creation: every alert creation or modification. alert sending: the event of sending a specific alert to a specific user. alert opening: event generated when a user opens an alert. api: every use of Trovit API. api error: every error generated by Trovit API. app installation status: information about installation or uninstallation of Trovit mobile app. app push status: settings of the notifications in a mobile app. app push send: every notification sent through the mobile app. app push open: every notification that has been opened by the user. Figure 2.1 show the UML class diagram representing the loglines where in green are the conceptual macrotypes of loglines, and in yellow are the types of loglines actually implemented. The existing ETL performs the following steps: reads the loglines from a messaging system (Apache Kafka); performs a cleaning process and then stores the loglines in a data storage named batch view; performs aggregations and stores materialized views in a data warehouse. We name the first data storage as batch view because it doesn t contain pure raw data but data that have already been cleaned and de-duplicated, but that are not aggregated. The architecture of the current solution is shown in figure 2.2. Data analysts can query the non-preaggregated data from the batch view or the aggregated data from the data warehouse, moreover they can access dashboards built on the data stored in the data warehouse. 2

Parquet is an efficient column-oriented binary format designed for Apache Hadoop.

10 CHAPTER 2. CURRENT SOLUTION 3 Figure 2.1: UML Class Diagram of loglines. Figure 2.2: Current solution. 2.1 Data storage Table 2.1 represents the different characteristic of the data stored in the batch view and in the data warehouse Batch view The batch view has been built using Apache HDFS and it stores Apache Parquet files, used for quick access to the original data with SQL-like queries through tools like Hive and Impala. Parquet is an efficient column-oriented binary format designed for Apache Hadoop. Compared to other formats like Sequence Files or Apache Avro, that are row-based, Apache Parquet is more efficient when specific columns are queried and few joins are performed [3]. This is the reason why it has been chosen as format for the batch view, where the loglines stored have several columns and most of the queries performed select only few columns and are focused on one type of logline (thus no joins are performed). Moreover, Parquet allows efficient compression of the data. Another format that would be efficient for Trovit use case is Apache ORC (Optimized Row Columnar), however this is not fully supported by Cloudera Impala, a query engine used by Trovit alternatively to Apache Hive, and additionally there are tests benchmarking ORC and Parquet that demonstrate that Parquet is more efficient [4]. However, to effectively choose the best format, in the future, new empirical tests should be performed, but this goes beyond the scope of this thesis. Data stored in Parquet are made available to the user through Apache Hive and in particular with the creation of partitioned external tables. Since most of the queries performed on the batch view analyzes data

11 CHAPTER 2. CURRENT SOLUTION 4 Characteristic Batch view Data warehouse Granularity Logline Different granularities in different views (Day or month, vertical, country, etc) Format of the data Parquet Relational tables Data storage Apache HDFS MySQL Historicity (= range of data stored) Since ever Since ever Data freshness (= time interval between 1 hour and 30 minutes 1 hour and 40 minutes when the data are produced and when (approximatively) (approximatively) they are inserted in the data storage) How data are accessed Apache Hive queries, Cloudera Impala queries Sql queries, reporting tools Table 2.1: Characteristics of the data of few days or few months, tables are partitioned by day to allow fast queries avoiding full table scans. In fact, all the queries performed with Hive always filter on the partition, named s datestamp, using the format SELECT col i1, col i2,..., col ik FROM table name WHERE s datestamp IN (day1, day2,..., dayn) AND additional filters. An example of the Parquet format for the logline click has been provided in the appendix A Data warehouse The data warehouse is implemented on a relational database. The concept of data warehouse developed in Trovit is a set of several materialized views of aggregated data, analyzed through SQL queries. Currently the data warehouse consists of 41 tables, considering only the ones displaying internal statistics and created through the aggregation of the 15 Hive external tables previously mentioned. Because of the slow read performances of the relational database, expensive SQL operations between the different tables are trying to be avoided, thus every time that new analysis are required and are forecasted to be performed quite often new tables are created or the old ones are updated. This is done with the awareness that a relational database is not best choice for the implementation of Trovit data warehouse and that in the future a new solution will be designed. Similarly, the concept of a data warehouse as a set of materialized views may in the future be substituted by a multidimensional model that would enable OLAP queries. A detailed study would be needed to analyze the limitations of using a relation database for Trovit data warehouse and the possible alternatives. However some evident limitations of using MySql as data warehouse are the following: it cannot scale out; it performs slow writings if compared with some distributed storage systems; it cannot handle concurrent writings and readings. Generally the tables available in the data warehouse have a granularity of one day, one vertical and one country. Some of the tables also have other levels of granularity, for example a source, that is a specific website from which ads are taken. 2.2 Data flow The ETL process consists of two sequential batch flows, this is due to a rearrangement of an old batch flow previously existing in Trovit. The first flow, represented in figure 2.3, is executed every 15 minutes, it

CHAPTER 2. CURRENT SOLUTION 5 collects the loglines from Apache Kafka, it transforms them in JSON format and stores them in a temporary storage from where data will be read by the second flow.

4) is executed every hour and performs the following steps: 1. Reading the loglines from the temporary storage. 2.

12 CHAPTER 2. CURRENT SOLUTION 5 collects the loglines from Apache Kafka, it transforms them in JSON format and stores them in a temporary storage from where data will be read by the second flow. The temporary storage is located in Apache HDFS. The second flow (figure 2.4) is executed every hour and performs the following steps: 1. Reading the loglines from the temporary storage. 2. De-duplicating the loglines read and splitting them in small files, assuring that no loglines from two different days are in the same file. 3. Compacting the loglines per day and performing some additional logic to add information to the final statistics (invalid click detection and out of budget detection). 4. Storing the data in the batch view. During this phase a new partition with the daily loglines is added to the related Hive external tables. 5. Aggregating the loglines per day and storing them in the data warehouse (overwriting the values written with the previous batch processes for the same day). Figure 2.3: Batch flow 1. Figure 2.4: Batch flow 2. Steps 2,3 and 4 are MapReduce jobs and read from HDFS, thus each of these steps stores data into disk twice, once after the map phase and one after the reduce phase. Step 5 uses Apache Sqoop, a tool able to perform the data transfers between HDFS and MySql. Sqoop is the most common tool used to transfer data from HDFS to a structured database, however the internal implementation of Sqoop, based on MapReduce jobs, may be a bottleneck for the application because data are stored on disk after each map and reduce operation. The batch process as described above makes data available on MySql with a delay

13 CHAPTER 2. CURRENT SOLUTION 6 Characteristic ETL Latency (= the time interval between the beginning of the process and the update of the data warehouse) Data freshness (= the time interval between when the data are produced and when they are inserted in the data warehouse) Technology Modularity Data Flow Approx. 30 minutes Approx. 1 hour and 40 minutes Apache Hadoop Map Reduce, Apache Sqoop The ETL process is able to handle the different types of loglines. Table 2.2: Characteristic of the data flow of approximately one hour and 40 minutes, due to the scheduled delay of the first flow (15 minutes) and of the second flow (one hour) and to the execution time of the full flow (approximately 30 minutes). Table 2.2 shows the characteristics of the data flow. 2.3 Limitations of the current solution The main limitation of the process currently in place in Trovit is that both the batch view and the data warehouse are updated approximately every one hour and 30 minutes. This is mainly due to the fact that the whole process is implemented with Apache Hadoop MapReduce, whose bottleneck is the writing to disk after each phase. For this reason Trovit decided to explore a solution that could enable near real-time analytics, in particular with the following motivations: to allow data analysts to react on time to possible business changes; to allow developers to have an immediate overview of the results of changes made to the processes, correcting possible bugs with small delay and activating automatic monitoring based on online data; to automate processes of decision making that help the company to better monetize the clicks received according to the clicks registered until that moment. Regarding the last point it is important to notice that one of the source of revenue for Trovit is that partner websites pay each time a user click on one of their ads. However the partners agree with Trovit for a daily budget that they are willing to pay. For this reason its important to be able to detect almost on real-time when a partner has already ended its daily budget, and to do that to be able to detect which clicks must be invoiced (valid) and which not. Currently Trovit calculates this information every hour with the batch flow, having this information in near-real time would allow to detect immediately which source website do not have any budget left and to automatically favor in the result page the ones that still have budget.

14 Chapter 3 Related work In the past different architectures have been studied to enable real-time analytics. However most of the solutions proposed (e.g. fully incremental architecture described in [5], Kappa architecture [6] and Zeta architecture [7]) have been discarded in favor of the lambda architecture, that is currently the most commonly adopted. Lambda architecture [1] is the data-processing architecture that enables the possibility of exploiting both batch- and stream-processing methods in the attempt of balancing latency, throughput, and accuracy. It is an architecture design where a sequence of records is fed into a batch system (named batch layer) and a stream processing system (named speed layer) in parallel, as shown in figure 3.1. The logic in the two layers Figure 3.1: Lambda architecture from a high-level perspective [8]. may be slightly different because each layer aims to satisfy specific requirements. The batch layer manages the master dataset (an immutable, append-only set of raw data) and continuously creates batch views of the data. This is a high-latency operation, because its running a function on all the data, and by the time the batch layer finishes, a lot of new data will have collected that is not represented in the batch views. For this reason while batch views are executed, the speed layer takes care of creating real-time views of the data received in the meanwhile. The results are merged together in a phase named serving layer, where the actual output for the user is produced. Although many companies are currently using or implementing lambda architectures, it is quite difficult to find details about the design of their systems. For this reason, this section will be focused on the technologies available to implement the system. While the batch layer already exists and is implemented using Hadoop MapReduce, for the speed layer and serving layer the technologies have been chosen by the company and are respectively Apache Spark Streaming for the first and Aache Spark for the second. However this section 7

CHAPTER 3. RELATED WORK 8 will give an overview of the main technologies used for use cases with similar requirements. For the speed layer several streaming technologies are available.

15 CHAPTER 3. RELATED WORK 8 will give an overview of the main technologies used for use cases with similar requirements. For the speed layer several streaming technologies are available. The most common are currently Apache Storm, Apache Spark and Apache Flink. Trovit decided not to use Storm because of the bugs and difficulties encountered when setting up and running other applications with this technology and because it is not developing rapidly, in comparison with Spark and Flink that costantly have new releases and improvements. Moreover Storm doesn t provide a framework for batch processing so to choose it would have implied to deal with a second technology for the serving layer. With respect to Spark and Flink, the first has been chosen mostly because more mature and because of the support available. More detailed about the comparison between Spark and Flink are available at section For the serving layer the candidate technologies were several: Hadoop MapReduce, Spark, Metis, and other frameworks available for batch processing. While MapReduce has been excluded for its bad performances in this specific use case (an explanation is available at section 4.6), Spark has been chosen by Trovit over the others because it has better performances when data fits in memory, because it exploit data locality, and especially for its maturity with respect to the other technologies, for the libraries available that allow the connection to different destination sources, and because the same tool also offers a streaming solution, so it can be used for both speed layer ad serving layer. Metis has been excluded because it doesn t scale good for input sizes greater than 1 GB, this can be clearly seen in the benchmarking of figure 3.2 taken from [9]. Figure 3.2: Frameworks benchmarking (figure from [9]).

16 Chapter 4 Design of the solution 4.1 Requirements Considering the limitations of the current solution described in the previous chapter, a list of the requirements that the final solution should satisfy has been drafted in table 4.1. In the definition of the requirements we will use the name sync view to refer to the data storage containing the non-aggregated loglines (same format as in the batch view of the current solution). The main goal for the design of a new solution is to have fresher data to analyze. In particular, it has been evaluated that a maximum freshness of 2 minutes for the loglines registered in the sync view and of 6 minutes for the aggregated view named source revenues stored in the data warehouse would be an optimal solution to overcome limitations of the current system. The functional requirements shown in the table are mainly related to the technologies that were imposed by the company: Apache Kafka as input system, Apache HDFS for the sync view and MySql for the data warehouse. 4.2 Architecture Considering the requirements listed in the previous section, the solution proposed is a lambda architecture that could exploit the current ETL flow as batch layer, adding a speed layer to provide data with high freshness. The architecture proposed, shown in figure 4.1, consists of 4 modules: an input messaging system, a batch layer, a speed layer, and a serving layer. Differently from the high-level definition of lambda architecture described in chapter 3, in Trovit case the lambda architecture will not have a master data set. This is because raw data are not stored anywhere, instead data are processed by the batch layer and stored in the batch view. This means that once the batch processing is complete, raw data are not accessible anymore. At the same time the speed layer processes data and writes them in a storage named speed view. In the serving layer the two views are synchronized and an unique view, the sync view, is shown to the user. 4.3 Messaging system The messaging system collects the loglines from the website, from the mobile application and from the system. When the processes of the batch layer and the speed layer query the messaging system they receive an identical copy of the events. Trovit currently uses Apache Kafka as messaging system. The system has two brokers, and each event corresponding to a type of logline is stored in a specific topic (thus there are 15 topics, each of them with two partitions). 9

17 CHAPTER 4. DESIGN OF THE SOLUTION 10 F U N C T I O N A L N O N F U N C T I O N A L Requirement Data input system Data flow Data storage Data stored in Parquet Accessibility Failure management Data freshness in sync view Data freshness for view source revenues Fault tolerance / Robustness Horizontal scalability Modularity Accuracy Loosely Coupled Interest of the user Description Apache Kafka The new data flow will have to perform the data de-duplication and the invalid click detection as in current solution. Non-aggregated data must be stored in the sync view (HDFS), while aggregated data must be stored in MySql as in the current solution. Data must be inserted in the sync view as Apache Parquet files. Data stored in the sync view must be accessible through Apache Hive and Cloudera Impala. Errors during the ETL process must be notified by . In case of non-blocking errors, data must be collected and periodically sent by (details about the types of errors are available at section 5.3). - Number of sent per blocking error: 100% - Number of sent per non-blocking error: one the first time the error is seen and one if the error is repeated 100 times in one hour. 2 minutes 6 minutes The system designed must be able to continue operating properly in the event of failure, and auto-restart if needed. - Maximum number of times the system fails (blocking errors): twice a week. - Minimum number of times from which the system recovers automatically: 99%. - Maximum number of non-blocking errors: one per hour. The solution must be able to deal with increasing amounts of data just adding additional computing nodes to the system. The cluster must be able to run the application even if the number of logs would become 10 times bigger. The solution proposed must be able to easily handle the different types of loglines existing, and possible new ones. - Each layer (speed layer or batch layer) should process the loglines at most once. Acceptable error: there may be duplicate loglines in a maximum span of 1 minute and only for the current day. - Data should be processed at least once.acceptable error: there may be missing loglines in a maximum span of 1 minute and only for the current day. The solution proposed must allow to easily change the destination storage with any of the storage available in the company. The new characteristics of the process should have a strong positive impact on the daily work of the data analysts. Table 4.1: Project requirements

CHAPTER 4. DESIGN OF THE SOLUTION 11 Figure 4.1: Architecture proposed. 4.4 Speed layer The choice of using a streaming processing framework is due mainly to the requirement of a maximum freshness of the data stored in the sync view of 2 minutes.

18 CHAPTER 4. DESIGN OF THE SOLUTION 11 Figure 4.1: Architecture proposed. 4.4 Speed layer The choice of using a streaming processing framework is due mainly to the requirement of a maximum freshness of the data stored in the sync view of 2 minutes. In fact the existing process, implemented using Hadoop MapReduce, has an execution time of approximately 30 minutes, and it cannot be improved to satisfy the data freshness requirement. In the architecture proposed the streaming framework is used for data injection (and not for aggregation). The initial idea of substituting MapReduce with Apache Spark would have lower down the time latency, however this wouldn t have been sufficient to satisfy the requirements. In fact, supposing an implementation with Apache Spark, and considering that the messaging system stores the events non-ordered, the process should scan every time all the loglines stored in the messaging system and then only filter the ones related to the last minute. This would be inefficient, considering that in Trovit the implemented messaging system retains the data of 7 days and that, for example, for the loglines of type click there are approximately 5 millions of events per day. An alternative would be to implement a solution that stores the offset of the events already processed, but this would require a complicated logic and would be inefficient and highly error-prone. For this reason it has been chosen to use a stream processing framework, and in particular Apache Spark Streaming. The speed layer consists of micro batches running every minute, each micro batch reads the events directly from Kafka, performs the data de-duplication and the logic of invalid click detection and then stores the data in HDFS in Parquet files, one folder is created for each micro batch (one minute). The destination storage of the speed layer will be named speed view. Moreover, each micro batch creates a new partition in the external Hive table. 4.5 Batch layer The batch layer consists of the ETL process described in chapter 2, that is executed every hour. The output of the batch layer are Parquet files stored in the batch view that are split by day in different HDFS folders. Each HDFS folder correspond to one day and correspond to one partition in the Hive external table.

CHAPTER 4. DESIGN OF THE SOLUTION 12 4.6 Serving layer The serving layer consists, first of all, of the the batch view and the speed view.

19 CHAPTER 4. DESIGN OF THE SOLUTION Serving layer The serving layer consists, first of all, of the the batch view and the speed view. Both views contain data with the same structure but the first is produced by the batch layer and the output is partitioned by day, while the second is produced by the speed layer and the output is partitioned by minute. A Hive external table has been created to join the two views, in this way users can query only one table and see the result of both batch and speed layer together. This unique view is called sync view, as shown in figure 4.2. Since both batch and speed layer have output in the same table, it is important that when the batch layer writes new data, it also overwrites the same data previously written by the speed layer. The reason for the overwriting is that the batch process performs some additional logics that are not implemented in the speed layer. The additional logics are data transformations that require to retrieve extra-data from other databases and to write the results in other destination storages in other formats, thus are time-consuming and cannot be implemented in the speed layer because they wouldn t allow to satisfy the data freshness requirement set. Since the mentioned logics are not implemented in the speed layer, data written by the speed layer lack some information that are not needed by data analysts when analyzing recent data (current day) but that are required for analysis that include older data (e.g. analysis about the last month clicks). The synchronization process is motivated from the fact that data from the speed layer are used exclusively to enable near real-time analytics and they must be overwritten with the data produced by the batch layer when available. Moreover, the serving layer Figure 4.2: Sync view in the serving layer includes a process that reads the data of the current day from the sync view, performs some aggregations and dumps the new data into the data warehouse (as shown in figure 4.1). The sync view is queried by data analysts using Apache Hive and Cloudera Impala, the data warehouse is queried by data analysts and by reporting tools used to display internal statistics to the employees. To perform the data aggregation and the dump in the relational database a process already exists. The current solution uses Apache Sqoop to move the data from the source to the destination. The bottleneck of this process is the usage of Sqoop, that uses MapReduce paradigm. For this reason, the solution proposed include a new process for the serving layer implemented using Apache Spark. Taking as example one materialized view, named source revenues, generated aggregating the clicks of the current day, the execution with sqoop doesn t assure a data freshness of 6 minutes. In fact the execution takes approximately 3 minutes (that means a total latency of 5 minutes considering also the speed layer) but it can last longer if the resources are not allocated quickly. Executing the same view with Spark the execution takes in average 1.4 minutes. This is due to the fact that all the data queried fits in memory (considering that the clicks of one day are around 4 millions and are stored in less than 1 GB).

20 Chapter 5 Implementation The Master Thesis project includes the implementation of the following: the full speed layer; the synchronization between the batch layer and the speed layer; a new data flow for the serving layer. The development has been realized following the Agile methodology, more details about it are available in the Appendix B. The technologies involved in the implementation, that will be mentioned in the next sections, include Apache HDFS, MySql, Apache Zookeeper, Apache Spark and Spark Streaming, Sentry and Graphite. The programming languages used are Java and Scala. All the data storages and the technologies used for processing were imposed by the company. Additional supporting technologies (like Zookeeper, Sentry and Graphite) were chosen between the ones available in the company. 5.1 Speed Layer Technology At the moment in which the thesis was started there were two main technologies available to implement the streaming process: Apache Spark Streaming and Apache Flink Streaming. Both systems provide with a very high throughput compared to other processing systems, and guarantee that every record will be processed exactly once. The main difference between the two technologies is that Flink implements a true stream (continuous flow operator-based model) while Spark approximates a stream using really small micro-batches. Considering the requirements given by Trovit, a real stream is not needed because micro-batches of one minutes are sufficient to satisfy them. In fact, while there are some use cases where real-time may be a better solution (for example in case of credit card fraud detection), in Trovit the results of the streaming application will be mainly used by humans to analyze data and in this case the difference between a latency of 0.5 seconds or of one minute is not relevant when compared with the human reaction. Moreover the company had already adopted Spark for other applications, and this was an additional incentive to use Spark Streaming since Spark was already installed and configured and some people in Trovit already had experience with it. Additionally, Spark Streaming technology is slightly more mature than Flink and it is used by many companies in Barcelona; this allows an easier acquisition of knowledge and the support of the community of users. Moreover, it is really important for a company that is adopting a new technology to have a clear support from consultants in the area and internationally, and this is the case of Spark. In fact, there are many consultancy companies that offer support with Spark in Barcelona (not as much for Flink) 13

CHAPTER 5. IMPLEMENTATION 14 and at the same time Trovit can always rely on the support of big companies like Databricks, Cloudera and Hortonworks.

21 CHAPTER 5. IMPLEMENTATION 14 and at the same time Trovit can always rely on the support of big companies like Databricks, Cloudera and Hortonworks. With respect to stateful transformations, both systems support them and implement the state as a distributed in memory key/value store (however for Spark this is true only from version 1.6.0). According to [10] when increasing the size of the state Spark remains reliable and does not crash while Flink may throw Out- OfMemoryErrors and fail the computation due to the fact that it cannot spill the state to disk (however Flink version offers the possibility to use an out-of-core state based on RocksDB). When choosing which technology to use, it has been considered that the same technology should be used for both the speed layer and the serving layer. This implies a new requirement: the easiness of integration with destination sources (in the project implemented, MySql). Spark provides an easy way to connect to many destination storages. In particular the dump of the processed view in MySql with Spark resulted in one line of code, exploiting the Spark SQL module and its JDBC connector. One of the requirement, though, was the possibility of flexibly changing the destination source. Considering other storages currently used in Trovit, like Apache HDFS, ElasticSearch and Redis, we can assert that Spark has connectors for all these technologies while Flink currently supports Apache HDFS and ElasticSearch but not Redis Design The process consists of small micro-batches of one minute. In each micro-batch the following actions are performed: Events are read from a Kafka topic. The object logline is created. In this phase possible errors in the syntax of the event received are detected: loglines with errors are excluded from the process and a notification is sent by . Loglines that were already processed by the batch layer are excluded from the process (this may happen in case that the streaming process had stopped and it is recovering reading from Kafka all the missing events - not only the last minute). Duplicate loglines are discarded (e.g. if a click is registered twice with same timestamp). Invalid clicks are detected (this is executed only for the loglines of type click). Data are written to Apache HDFS in Parquet format. A partition is added to the Hive table (the name of the partition will be the timestamp of the earliest logline received inside that micro-batch). Figure 5.1: BPMN of the speed layer. During the tests it appeared that sometimes data were not well-distributed between the executors, this was mainly happening in case that Spark was reading more data from one Kafka partition or after recover

CHAPTER 5. IMPLEMENTATION 15 from failure. To solve this issue a data repartition has been introduced at the beginning of the flow, soon after reading data from Kafka.

22 CHAPTER 5. IMPLEMENTATION 15 from failure. To solve this issue a data repartition has been introduced at the beginning of the flow, soon after reading data from Kafka. The number of partitions is chosen dynamically according to the number of executors and cores set up during the configuration. Even if repartitioning is always an expensive operation, it has been experimentally seen that a repartition at the beginning of each micro-batch was making the rest of the job must faster, resulting in an overall increase of the performances. Moreover this allows to use more executors than the number of partitions, because without repartitioning any additional executor would stay idle Invalid clicks detection Invalid clicks detection is performed only for the loglines of type click. The logic of the validation of the clicks is the following: if the same click (same user clicking on the same ad) is performed more than once in one minute, only the first click must be considered as valid (boolean value set to true) while all the other clicks received in the same minute are set as invalid. To perform this check every micro-batch must know which was the last valid timestamp for each click (considering the combination of the user and the ad clicked). To implement this logic the stateful operation updatestatebykey has been used. This transformation allows to maintain an arbitrary state in memory while continuously updating it with new information [11]. In our case the state is a structure {click key, last valid timestamp} that is updated at every micro-batch. In this way every micro-batch will be able to retrieve the last valid timestamp for the click processed to determine if the new clicks received are valid or not, and it will update the structure with the new values. The click-key is composed by the information about the user (IP address, user agent, type of browser) and about the ad (page identifier, section identifier, ad identifier). In case a click-key is not present in the current micro-batch (one minute), it is released from the memory, as shown in figure 5.2. Figure 5.2: Stateful transformation for invalid clicks detection. The Spark transformation updatestatebykey keeps the state in memory. This can result in a scalability problem if the size of the state increase. However in our case this transformation is used only for the loglines of type click where currently a maximum of 5000 events per minute are received (a size that largely fits in memory) and the state always contains as maximum number of elements the quantity of elements received in the last minute. This is because, as previously explained, in case a key is not received in the current micro-batch it is released from memory. It is important to notice that in order to use stateful transformations in Spark checkpointing must be enabled. More detail about checkpointing implementation and related problems are available at the section 5.4.

23 CHAPTER 5. IMPLEMENTATION Serving layer The serving layer designed exploits some elements available in the previous version implemented that are: an hard-coded configuration detailing the set of materialized views to dump in the data warehouse. The configuration includes the definition of the source and destination but also the frequency at which the view must be executed (that implies the freshness required for that specific view). a database maintaining the history of the views executed, including start and end time of each view and the total execution time. The process implemented consists of two parts. The first is the synchronization part that is scheduled after each new batch view is generated, and the second is the dump of the materialized views in the data warehouse Synchronization If the sync view would have been implemented in a relational database the synchronization would have simply been an update of data previously written by the speed layer with the new ones from the batch layer. In HDFS the update of a single element is more difficult, for this reason it has been decided that all the data written by one micro-batch will correspond to one Parquet file and one Hive table partition. In this way, during the synchronization phase, the batch process calculates the maximum timestamp of the logline it is being processed and deletes all the partitions and the files generated by the speed layer with lower timestamp. Since the batch layer cannot delete single loglines but only entire one-minute partitions, this process may generate error in one minute span (either duplicate loglines or missing loglines for maximum one minute). However this error has been considered not relevant for the daily queries performed, and in any case it appears only in the current day because data from the previous days are completely overwritten by the batch layer Dump of the materialized views in the data warehouse This process is an infinite loop that, every four minutes, executes the following steps (as described in figure 5.3): 1. The list of the materialized views to execute and their configuration is read. 2. The last execution of the view is retrieved and compared with the view configuration to decide whether to execute it. If it is not required to execute it the process continue to the following view of the list (loop task in figure 5.3). 3. Data are retrieved from the sync view through a SQL query that includes the data aggregation (performed using Spark SQL). 4. Data retrieved are dumped in the data warehouse in a temporary table. 5. Data are read from the temporary table and used to update the destination table in the data warehouse. 6. The database containing the last execution of each view is updated. 7. The execution continues to the next view. The reason for the intermediate step of storing data in a temporary table in the data warehouse (step 4 described above) is that Spark SQL doesn t handle the function of update of relational databases.

24 CHAPTER 5. IMPLEMENTATION 17 Figure 5.3: BPMN of the serving layer. 5.3 Exception handling In both the implementations of the speed layer and serving layer the exceptions have been categorized in two types: Non-blocking exceptions: they are minor issues that do not require to stop the process and that become relevant only if they are repeated. Blocking exceptions: they are main errors that require to stop the application and immediate notification of the developers/administrators Non-blocking exceptions In the speed layer these are mainly the errors that happen when parsing a single log received by Kafka. In the serving layer this can be an error that occurred in only one of the scheduled views and do not affect the whole process (for example in case Spark is querying data from the sync view while the synchronization process is deleting those). For non-blocking exception the notifications are managed through Sentry [12], a crash reporting tool that allows to customize the ratio of s to send to the developers/administrators in case of error. Non-blocking exceptions are notified to the user through the first time that are seen and any time that are seen again after a specific threshold set (in our case 2 hours for the errors occurring in the speed layer and 1 hour for the errors in the serving layer) Blocking exceptions These are all the errors occurring during the initialization of the job, including all the connections to the databases, or the fatal errors. An example is the error generated in case the streaming job cannot connect to Apache Zookeeper. In fact, the streaming application should read from Zookeeper the timestamp of the last data written by the batch process to assure that data are not duplicated, in case this value cannot be

25 CHAPTER 5. IMPLEMENTATION 18 read the application has to stop. When blocking exceptions happen an is immediately sent to the developers/administrators. 5.4 Fault-tolerance The system designed must be able to operate 24/7 and to continue operating properly in the event of the failure, and auto-restart if needed. With this purpose Spark Streaming provides a feature that allows to checkpoint information about the running process on a fault-tolerant storage system. The speed layer implemented checkpoints the metadata of the process on Apache HDFS. This allows to the system to recover in case of failure and to remember the offset of the last data read from Kafka. In the architecture implemented if the streaming job fails it will try to restart after few minutes. In case the time needed to recover is longer (for example in case a database is not accessible for one hour) two main points must be taken into consideration: If in the meanwhile the batch process was executed, data already written by the batch process dont have to be processed; The amount of data to process can be much bigger than the amount usually processed in one single micro-batch. To overcome the first problem at the beginning of each micro-batch the timestamp of the most recent logline processed by the batch is retrieved from Apache Zookeeper and loglines with timestamp lower than that are discarded. With respect to the second point, a maximum rate of events per second is set during the configuration of the streaming process. The rate is different for each type of logline. When setting the value for the rate per second, setting a too low value may bring to a long delay of the recovery while setting a value that is too big may bring to the program failure if the memory allocated is not enough to deal with that amount of events. For this reason some experiments have been made to chose a configuration that assures the recovery of the application. Checkpointing resulted to be a great solution to assure the fault-tolerance of the application. However it must be considered that this functionality is quite recent and that it still has same flaws, many times during the programming phase workarounds have been used to overcome some problems, especially due to the fact that checkpointing is based on serialization and not all the Java object used are serializable or can be made so. 5.5 Monitoring and guarantee of satisfaction of the requirements Graphite [13], a scalable real-time graphing system, has been used to monitor the streaming application. Metrics sent to Graphite includes the number of loglines received from Kafka, the number of loglines correctly parsed and of the ones where errors are encountered. Furthermore, to assure that the requirements are satisfied at every time, a separate project has been created to check both the processes of the speed layer and serving layer. In particular, this project checks: That data are continuously read by Kafka (checking if a new metric has been created in Graphite in the last minute). That streaming partitions are created in Hive every minute. That the freshness of the data of the materialized views created in the data warehouse respect the requirements (e.g. that the view source revenues has been executed at least once in the last 6 minutes).

Chapter 6 Experiments and discussion Experiments have been executed in order to find the optimal configuration of executors, cores and memory for both the Spark Streaming job that implements the

26 Chapter 6 Experiments and discussion Experiments have been executed in order to find the optimal configuration of executors, cores and memory for both the Spark Streaming job that implements the speed layer and the Spark job that implements the serving layer. All the experiments have been run on a cluster of 48 machines, where other Spark jobs and Map Reduce jobs were running. The resources are allocated through the cluster manager, YARN, and the jobs are always run in cluster mode. The experiments have been run with different number of executors, however this number is always set in the configuration. In fact, it has been chosen not to use the dynamic allocation (auto-scaling) since for the streaming job the rate of events read from Kafka is quite stable and finding the optimal configuration manually allows to save many resources. With respect to the serving layer, not to use the dynamic allocation allows to assure the usage of the minimum amount of resources needed to complete the job. 6.1 Speed layer Figure 6.1: Speed layer: experiments with ad impression loglines. 19

27 CHAPTER 6. EXPERIMENTS AND DISCUSSION 20 Figure 6.2: Ad impressions - Adding cores. Figure 6.3: Ad impressions -Adding executors. For the speed layer, experiments have been made with two types of loglines: ad impressions (figures 6.1, 6.2 and 6.3) and clicks (figures 6.4, 6.5 and 6.6). The rate of events received in the two cases is largely different: respectively and 3500 events per minute (considering the daily average). In all the experiments the driver memory has been fixed to 512 MB because it has been empirically seen that this is the minimum amount of memory needed to run the job, and that increasing this value do not affect the performances. When performing the experiments, it is important to consider that the optimal configuration shall allow the recovery of the job through checkpointing. The tests executed considered a recovery time of 15 minutes, however normally when the job fails it restarts in less than five minutes. In the experiments, the recovery is considered successful if the jobs starts successfully and it is able to process all the data generated in the 15 minutes that the streaming process was not working together with the new data arriving in the first 10 minutes of the job. This means a delay in the processing of the new data arrived is allowed only in the first 10 minutes after the recovery. In the tables 6.1 and 6.4 the configurations highlighted in red are the ones that cannot be accepted because they dont satisfy the requirement of two-minute latency or because they cannot handle recovery. The configuration in green is the one that has been chosen as optimal. Chart in figure 6.2 shows the performances Figure 6.4: Speed layer: experiments with click loglines.

28 CHAPTER 6. EXPERIMENTS AND DISCUSSION 21 Figure 6.5: Clicks - Adding executors. Figure 6.6: Clicks - Adding memory. of the job when adding cores. While the switch between 2 to 4 cores produced a big difference in the performances of the job (in the experiment that uses only two cores the recovery failed), keeping adding the number of cores doesn t seem to improve the performances relevantly. Similar conclusions can be drawn when experimenting on memory. The chart of figure 6.6 shows that after increasing the memory of the executors until reaching the optimal configuration, the performance increase is not relevant anymore (or too minimal to justify the usage of more resources). Charts in figure 6.3 and 6.5 describe the scalability factor, and it particular it is possible to detect a point in which using more resources (adding executors) do not pay off because the average time decrease is really low. The threshold after which adding executors doesn t bring any benefit is observed at 4 executors for impressions (figure 6.3) and 2 executors for clicks (figure 6.5). It s also important to notice that normally launching more that two executors on a Kafka topic that has only two partitions is not recommended, because two executors would read from Kafka while the other executors would be idle, however in the case of the speed layer implemented the data are repartitioned after reading, and this allows to exploit the additional executors. 6.2 Serving layer One of the requirements of the project was to produce the materialized view source revenues with a maximum latency of 6 minutes. As shown in the charts of figure 6.8, 6.9 and 6.10, similar conclusions as for the speed layer can be drawn. The charts shown the increase of the performances when adding memory or cores, and a point where the increase is so small that it doesn t justify the use of the resources. In the chart of figure 4 we can notice that increasing the number of executors to 8 even decreases the performances, and this is probably due to the time needed to repartition a small amount of data across 8 nodes. The configuration highlighted in green in figure 6.7 is the one that has been chosen as optimal. When analyzing the performances of the serving layer it is important to notice that one of the bottlenecks of the application is the dump of the data into a relational database, due to the fact that parallel writings are not possible. However the choice of the data storage for the data warehouse goes beyong the scope of this project. An additional requirement for the serving layer was the capability of handling other materialized views requested by the users. Experiments about it are available at appendix C.

CHAPTER 6. EXPERIMENTS AND DISCUSSION 22 Figure 6.7: Experiments with view source revenues Figure 6.8: View source revenues - adding cores. Figure 6.9: View source revenues - adding memory. Figure 6.10: View source revenues - adding executors.

29 CHAPTER 6. EXPERIMENTS AND DISCUSSION 22 Figure 6.7: Experiments with view source revenues Figure 6.8: View source revenues - adding cores. Figure 6.9: View source revenues - adding memory. Figure 6.10: View source revenues - adding executors. 6.3 Experiment results The table 6.1 shows the results of the experiments performed and the satisfaction of the results with respect to the metrics defined for the requirements.

30 CHAPTER 6. EXPERIMENTS AND DISCUSSION 23 N O N F U N C T I O N A L Requirement Metric Experiment result Data freshness in sync view Data freshness for view source revenues Fault tolerance (Robustness) Horizontal scalability Modularity <2 min <6 min Maximum number of times the system fails (blocking errors): twice a week. Minimum number of times the system recover automatically: 99%. Maximum number of non-blocking errors: one per hour. The cluster must be able to run the application even if the number of logs would become 10 times bigger. The solution proposed must be able to handle the different types of loglines existing, and possible new ones just adding one Java class. Experiment: Streaming job running two weeks. Result: data freshness <2 min for 99.5% of data. Considering uniform distribution of data along the day, only in 0.5% of the cases the data are processed with a delay of around 5/10 minutes due to a failure of the process (twice in a week) and to the time needed for the automatic re-start. Experiment: Spark job running two weeks. Result: data freshness <6 min for 98,5% of the data. Experiment: Streaming job and Spark job running two weeks Result: Streaming job failures = 4 Spark job failures = 0 Total failures = 4 (= max acceptable failures) Experiment: Streaming job and Spark job running two weeks Result: Total failures = 4 Total recovery = 4 =100% (The system recovered correctly after all the failures during the tests). Experiment: Streaming job and Spark job running one week Result: Streaming job non-blocking errors = 80 Spark job non-blocking errors = 3 Total non-blocking errors = 0.5 per hour Experiment: Streaming job and Spark job running for logline Click Result: Used 2% of the cluster. Its possible to deal with a quantity of clicks 10 times bigger assigning more resources to the jobs. Experiment: Created new logline Result: Added three Java class for streaming execution (one to define the data model and two classes to connect with the data source and destination)

31 CHAPTER 6. EXPERIMENTS AND DISCUSSION 24 Accuracy Loosely Coupled Interest of the user There may be duplicate loglines in a maximum span of 1 minutes and only for the current day. There may be missing loglines in a maximum span of 1 minutes and only for the current day. The solution proposed must allow to easily change the destination storage (data warehouse) with any of the storages available in the company. The new characteristics of the process should have a strong positive impact on the daily work of the data analysts. Experiment: Streaming job running two weeks Result: The only case in which there may be duplicates is in case that during one micro-batch one executor fails and when it is restarted it re-executes data that had already been inserted in the speed view. This can happen only in the range of data of one micro-batch (1 minute) and the probability that 2 executors fail in the same day is really low (it never happened during experiments). Moreover duplicates may be found only in data of the current day because data from the previous ones are always overwritten by the batch process. Experiment: Streaming job running two weeks Result: The only case in which loglines may be missing is if they had been processed by a micro-batch but their partition has been deleted during the synchronization phase and data aren t overwritten by the batch. This can happen only in the range of 1 minute. Example: Micro batch partitions: - Partition1 (18:05:00-18:05:59) - Partition2 (18:06:00-18:06:59) - Partition3 (18:07:00-18:07:59) Batch layer star processing data from 18:06:15, thus it deletes Partition1 and Partition2. Loglines between the 18:06:16 and 18:06:59 will be missing until the execution of a new batch process. Experiment: Analyzed further storages used in trovit (Apache HDFS, ElasticSearch, Redis) Result: Connectors available that can be easily used to change the destination storage: - Spark sql for Apache HDFS - Spark-Redis package [14] - Elasticsearch-Hadoop since version % of the users said that the project will have impact on their job, Among them 25% said that it will improve their job a lot. Details about the survey are in the next section. Table 6.1: Experiment results

32 CHAPTER 6. EXPERIMENTS AND DISCUSSION Interest of the user An online survey has been prepared and submitted to 25 employees in Trovit. The survey was strictly focused on the impact of the project on the job of the employees. Thus it was excluding the general impact on Trovit business through the improvement of some of the automatic decision-making processes because this cannot be demonstrated before having the project running in production environment for some months. Among the employees targeted, half of them belongs to the sales department and were interviewed about the impact of a smaller data latency in the data warehouse, the other half were business analysts, product managers and developers, that were interviewed about the impact of a smaller data latency in the sync view. Both surveys consisted of three questions related to: How frequently the employee was accessing the data discussed; Whether he/she would have accessed them more often in case of a smaller data latency; The rank of the impact that a smaller data latency would have on their daily work. The survey was answered by 17 people in total: 8 people within the sales group and 9 people in the second group. In the first group the 75% of the employees said that the project would improve their job, and among them the 25% said that the project would improve a lot. The answer of the second group gave similar results (respectively 71% and 28%). The questions used for the survey can be seen in the appendix D Comparison with current solution Tables 6.2 and 6.3 aim to compare the performances of the solution designed and developed with the current solution used in the company. This comparison is not aimed to substitute the old solution with the new one implemented but it simply wants to analyze the difference of performances of the two technologies used: Hadoop MapReduce and Spark. For this reason when calculating the execution time of the process implemented in MapReduce only the phases that are also implemented in Spark have been considered. The comparison involves the latency and the resources used (memory and cores) in an unit of time. For the new solution, number of cores and memory are calculated summing up the resource used by every executor. For the current solution, implemented with MapReduce, the calculation of the resources used resulted a bit more complicated because the solution is split in different phases and for each phase the number of mappers and reducers change along the day according to the amount of data to process. Therefore the values displayed in the tables are a weighted average that considers the variation of the number of mappers and reducers used as well as the time used to execute each phase. When analyzing the resources consumed it is important to notice that both the speed layer and the serving layer implemented in the new solution are continuous flows that allocate resources the first time they are started and never release them. For the existing solution, instead, resources are allocated when the process is started and released at the end, however the gap between the end of one process and the beginning of the new one is so small than it can be approximated as the resources would be permanently allocated Loglines injection The benchmarking of the current solution and the new implemented solution is shown in table 6.2. Experiments were performed using the loglines of type click.

33 CHAPTER 6. EXPERIMENTS AND DISCUSSION 26 Current solution Speed layer of the new solution Latency 27 minutes 25 seconds (guaranteed less than 1 minute) Number of cores 4* 4 Memory 2400* 1536 (512 for 2 exec driver) * approximatively, considering the weighted average of the resources used in the different different phases Table 6.2: Comparison of the logline injection Serving layer For the serving layer, experiments were performed using the loglines of type click and the materialized view source revenues. The results of the benchmarking are shown in table 6.3. The experiments compare the performances to dump a materialized view in the data warehouse, and exclude the synchronization phase of the new solution. Current solution Serving layer of the new solution Latency 3 minutes 85 seconds Number of cores 7* 8 Memory 4000* 1536 (512 for 2 exec driver) * approximatively, considering the weighted average of the resources used in the different different phases Table 6.3: Comparison of the serving layer

34 Chapter 7 Conclusion and future work In the last years companies have focused their efforts on near real-time analytics, understanding the importance of reacting on time to possible business changes. The project demonstrated how lambda architecture can be a good approach for a company to enable near real-time analytics integrating the existing batch processes with a streaming solution. The experiments showed that all the requirements have been satisfied and that Apache Spark Streaming resulted to be a good solution for Trovit use case. Moreover some experiments showed clearly the better performances of jobs run with Apache Spark and Spark Streaming rather than with Hadoop MapReduce, using the same amount of resources (and in some cases even less). This was due to the fact that in all the use cases the amount of data involved fitted in memory. However during the implementation it became clear how the development of a quite simple ETL flow requires lot of work because of the immature technologies. The same ETL would have probably been drawn really quickly in one of the graphical ETL tools used in traditional Business Intelligence, but currently dealing with tools like Spark and with their immaturity is the only way to get good performances. Despite the fact that Apache Spark is a recent technology, it must be acknowledged that the integration with other tools and databases (Hadoop, Hive, MySql, Kafka) are quite well supported and they resulted an easy step during the implementation. Furthermore, analyzing the differences between the new solution implemented (both speed layer and serving layer) and the one previously existing (that uses MapReduce and Sqoop), the first evidence is that the length of the code written in Spark is significantly smaller and more readable. Besides these considerations about the technologies involved, however, it is important to remember that the aims of the project was not to find an alternative to the data flow implemented in MapReduce but to be able to design and implement a solution that keeping the existing flow would increment the freshness of data to overcome all the limitations described in section 2.3. The attempt of finding the optimal configuration for each speed layer job running for each type of logline, as well as for each view that has to be executed in the serving layer resulted in a big amount of work. In the future a new logic should be created to be able to automatically configure the jobs (number of executors, memory and core assigned, as well as number of partitions) according to the input size of the job. This could be easier for the speed layer where the rate of events arriving is quite stable, more complicated for the serving layer, where the size of data read should be estimated. Moreover, since the serving layer final destination is a relational database, the configuration will also have to take in consideration it, for example in order not to overload the database with several connections. One of the evident problems of the proposed solution is the maintenance of the code in the batch layer and speed layer. In fact, even if an effort has been done to share the code between the two layers, still their implementation is dependent on their technology (MapReduce for the batch layer and Spark Streaming for the speed layer) and any future change in the logic should be reflected in both layers. For the two reason just mentioned future research should focus on how to automatically configure the jobs 27

35 according to their input and to the type of calculations performed but also on how to be able to abstract the flow implemented so to perform any possible change only once (instead of maintaining the two layers separately). If this abstraction of the flow, that in this way would be independent from the technology, will be possible, this will also enable the possibility of choosing the technology according to some parameters. This idea is currently under research and it is based on the awareness that each different technology may be the best choice in some specific case, according to the input of the job and to the calculations that need to be performed. In [9] we see an attempt to create a framework that may do exactly this: given a definition of the input and an high-level data flow definition (not dependent on the implementation), trying to choose at run time the technology where to run the job, the optimal configuration and to transform the data flow in executable code. 28

36 Appendices 29

37 Appendix A Logline The following is the Parquet schema used for the logline click: { type : record, name : AvroInternalStatsClick, namespace : com.trovit.internalstats.model.avro, Fields :[ { name : s unique id, type :[ string, null ]}, { name : dt date, type : string }, { name : fk c id tbl countries, type : string }, { name : fk i id tbl vertical, type : int }, { name : i testab id, type :[ long, null ]}, { name : s testab option, type :[ string, null ]}, { name : i origin, type : int }, { name : i browser, type : int }, { name : i section, type : int }, { name : i section type, type : int }, { name : b is premium section, type : boolean }, { name : i position, type : int }, { name : i page, type : int }, { name : i source click price, type : int }, { name : f source click price euro, type : double }, { name : s ip, type : string }, { name : s user agent, type : string }, { name : s what, type : string }, { name : s where, type :[ string, null ]}, { name : fk i id tbl types, type :[ int, null ]}, { name : fk i id tbl campaigns, type :[ int, null ]}, { name : c id ad, type : string }, { name : fk i id tbl sources, type : int }, { name : fk i id tbl regions, type :[ int, null ]}, { name : fk i id tbl cities, type :[ int, null ]}, { name : fk i id tbl city areas, type :[ int, null ]}, { name : fk i id tbl postcodes, type :[ int, null ]}, { name : s region, type :[ string, null ]}, 30

38 { name : s city, type :[ string, null ]}, { name : s city area, type :[ string, null ]}, { name : s postcode, type :[ string, null ]}, { name : i num pictures, type :[ int, null ]}, { name : b nrt, type : boolean }, { name : b is publish your ad, type :[ boolean, null ]}, { name : b out of budget, type :[ boolean, null ]}, { name : i suggester, type :[ int, null ]}, { name : s agency, type :[ string, null ]}, { name : s make, type :[ string, null ]}, { name : fk i id tbl makes, type :[ int, null ]}, { name : s model, type :[ string, null ]}, { name : fk i id tbl models, type :[ int, null ]}, { name : s car dealer, type :[ string, null ]}, { name : s company, type :[ string, null ]}, { name : fk i id tbl companies, type :[ int, null ]}, { name : s category, type :[ string, null ]}, { name : fk i id tbl categories, type :[ int, null ]}, { name : b is new, type :[ boolean, null ]}, { name : s v, type :[ string, null ]}, { name : s pageview id, type :[ string, null ]}, { name : fk i id tbl dealer type, type :[ int, null ]}, { name : fk i id tbl pricing type, type :[ int, null ]}, { name : s cookie id, type :[ string, null ]}, { name : s google id, type :[ string, null ]}, { name : fk i id tbl users, type :[ int, null ]}, { name : b valid, type :[ boolean, null ]}, { name : b out of budget original, type :[ boolean, null ]}, { name : i campaign type, type :[ int, null ]}, { name : i click type, type :[ int, null ]} ]} 31

39 Appendix B Agile Methodology The work of the Master Thesis was held with the collaboration of a team of five people working in Trovit. For the realization of the project the Agile development methodology has been followed. In particular, the following actions have been taken: daily meetings (around 10 minutes) where each team member explains what has been done and what is planned to do during the day; weekly sprints where each team member explains the tasks that he has completed and new tasks are assigned; quarterly retrospective meeting to discuss how the team is doing and the points to improve. The project of the Master Thesis has been initially divided into macro-tasks and then split into smallest tasks that have been added to the already existing product backlog of the team. Figure B.1 shows the schedule of the sprints in which the project was involved. Table B.2 is an excerpt of the team product backlog containing only the tasks related to the Master Thesis project. 32

40 Figure B.1: Weekly sprints. 33

41 Figure B.2: Product backlog. 34

Appendix C Additional views for the data warehouse An additional requirement for the serving layer was the capability of handling other materialized views requested by the users.

42 Appendix C Additional views for the data warehouse An additional requirement for the serving layer was the capability of handling other materialized views requested by the users. Experiments have been made to find an optimal configuration for two additional materialized views. With respect to the source revenues view previously analyzed, these views have to deal with a larger amount of data, for this reason they will have different requirements. The requirements that have been set are: The two views must run in the same application (thus finding a common configuration). Maximum data freshness for view hit zone montlhy = 12 minutes. Maximum data freshness for view source revenues hourly = 7 minutes. Highlighted in green in figures C.1 and C.2 is the configuration that can satisfy all the requirements for the two views. Figure C.1: Experiments view hit zone monthly 35

43 Figure C.2: Experiments view source revenues hourly 36

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale