UNIVERSIDAD POLITÉCNICA DE MADRID ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN

Size: px

Start display at page:

Download "UNIVERSIDAD POLITÉCNICA DE MADRID ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN"

Joella Wilkerson
5 years ago
Views:

1 UNIVERSIDAD POLITÉCNICA DE MADRID ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN MODULARIZING FLINK PROGRAMS TO ENABLE STREAM ANALYTICS IN IOT MASHUP TOOLS FEDERICO ALONSO FERNÁNDEZ MORENO 2018

2 DEPARTMENT OF INFORMATICS INFORMATICS 4 - CHAIR OF SOFTWARE AND SYSTEMS ENGINEERING TECHNICAL UNIVERSITY OF MUNICH Master s Thesis in Informatics Modularizing Flink Programs to Enable Stream Analytics in IoT Mashup Tools Federico Alonso Fernández Moreno

4 DEPARTMENT OF INFORMATICS INFORMATICS 4 - CHAIR OF SOFTWARE AND SYSTEMS ENGINEERING TECHNICAL UNIVERSITY OF MUNICH Master s Thesis in Informatics Modularizing Flink Programs to Enable Stream Analytics in IoT Mashup Tools Modularisierung von Flink Programmen zur Ermöglichung von Stream Analysen in IoT Mashup Tools Author: Federico Alonso Fernández Moreno Supervisor: Prof. (Chang an Univ.) PD Dr. habil. Christian Prehofer Advisor: M.Sc. Tanmaya Mahapatra, Dr. Ilias Gerostathopoulos Submission Date: 16 th July 2018

5 Ich versichere, dass ich diese Masterarbeit selbstständig verfasst und nur die angegebenen Quellen und Hilfsmittel verwendet habe. I confirm that this Master s Thesis in Informatics is my own work and I have documented all sources and material used. Munich, 16 th July 2018 Federico Alonso Fernández Moreno

6 Acknowledgments I would like to thank in the first place Prof. (Chang an Univ.) PD Dr. habil. Christian Prehofer, who gave me the opportunity of undertaking this project. I am also grateful to Dr. Ilias Gerostathopoulos, for being always there to help me when I needed it. And, of course, I am thankful to Tanmaya Mahapatra, for his advice and for trusting me during all these months. I would also like to thank my home university, the Technical University of Madrid (UPM), especially the staff in the International Office of the ETSIT department, because exchanges do make for better engineers and better people. My family has always supported me, regardless of the circumstances. I can only say thank you, because I have always felt the greatest support from them. I would simply not be here without them. To Alba, for sharing this path with me. And, of course, to all my friends that have been like family these months, those who I already knew and those who I have met during this year. They know what it means to me to complete this project. Thank you.

8 Abstract Among all the challenges that the Internet of Things (IoT) has posed, the analysis of large amounts of information in real time is probably one of the most considerable ones. To this purpose, novel Big Data approaches foster Stream Analytics as a solution to the strict latency and throughput requirements that IoT platforms impose. Thus, Stream Analytics should constitute an essential component in IoT applications. However, developing IoT applications is no easy task. To simplify the process of addressing heterogeneous devices at once, mashup tools are widely extended. They provide a lightweight, user-friendly way to prototyping, but currently they lack integration with platforms for Stream Analytics. The main goal of this Thesis is to integrate Apache Flink, an open source platform for distributed Stream Analytics, into aflux, an IoT mashup tool developed at the Chair of Software and Systems Engineering of the Technical University of Munich. To this end, a conceptual approach to modularize Flink programs in a generic, flexible and expandable way has been designed, implemented and evaluated. With this approach, end users may not only program Flink graphically, but also get support on how to do it right during the creation of these programs. Keywords: Stream Analytics; IoT Mashup Tools; Big Data Analytics; IoT v

10 Zusammenfassung Das Internet der Dinge ( Internet of Things (IoT) auf Englisch) wirft verschiedene Herausforderungen auf, in denen die Echzeitanalyse großer Datenmengen wahrscheinlich eine der wichtigsten ist. Zu diesem Zweck unterstützen neuartige Big-Data-Ansätze Stream Analytics als Lösung der strengen Latenzzeit- und Durchsatzanforderungen, die IoT Plattformen auferlegen. Daher sollte Stream Analytics eine wesentliche Komponente in IoT-Anwendungsprogrammen darstellen. Trotzdem ist die Entwicklung von IoT-Anwendungsprogrammen keine leichte Aufgabe. Um den Prozess zu vereinfachen, heterogene Geräte direkt zu kontaktieren, werden Mashup Tools weitgehend erweitert. Sie bieten eine einfache, benutzerfreundliche Art zum Prototyping, aber derzeit mangelt es ihnen noch eine Integration der Stream-Analytics-Plattformen. Das Hauptziel dieser Masterarbeit ist die Integration von Apache Flink, einer Open- Source-Plattform für verteilte Stream Analytics, in aflux, ein am Lehrstuhl für Software und Systems Engineering der Technischen Universität München entwickeltes IoT Mashup Tool. Dazu wurde ein konzeptioneller Ansatz zur generischen, flexiblen und erweiterbaren Modularisierung von Flink Programmen konzipiert, implementiert und evaluiert. Durch diesen Ansatz können Endbenutzer Flink nicht nur graphisch programmieren, sondern auch Unterstützung bei der Erstellung dieser Programme erhalten. Keywords: Stream Analytics; IoT Mashup Tools; Big Data Analytics; IoT vii

12 Resumen Autor: Título en castellano: Tutor: Institución: Lugar de lectura: Federico Alonso Fernández Moreno Modularización de programas Flink para el análisis de datos en tiempo real en herramientas de mashup para IoT Tanmaya Mahapatra Chair of Software And Systems Engineering. Universidad Politécnica de Múnich (TUM) Universidad Politécnica de Múnich (TUM). Garching bei München, Baviera, Alemania Fecha de lectura: 19 de julio de 2018 Contexto, descripción del problema y objetivos La cantidad de datos que se procesa en Internet está en constante crecimiento. El término inglés Big Data engloba los numerosos nuevos procedimientos que son necesarios para gestionar y analizar esta gran cantidad de información que proviene de diferentes fuentes (probablemente con estructuras también muy variadas), haciéndolo suficientemente rápido, pero sin dejar de proporcionar resultados que sean verdaderamente útiles. Los enfoques tradicionales de Big Data, denominados comúnmente procesamiento en lotes o, en inglés, Batch Analytics, tratan los datos como si fueran conjuntos acotados o tandas (batches en inglés). Se basan en recopilar los datos antes de analizarlos, lo cual simplifica notablemente su procesamiento, pero tiene muchas limitaciones, como las grandes capacidades de almacenamiento que se necesitan al no procesar los datos según se reciben, sino tras guardarlos. Aun así, los paradigmas de Big Data más conocidos, como MapReduce, están basados en este enfoque, el cual, aunque se comporta bien cuando lo más crítico es el volumen de los datos, no es capaz de proporcionar buenos resultados con baja latencia, lo que lo convierte en inapropiado para sistemas de tiempo real. Y es que, en muchos casos de uso, como transacciones financieras, interacciones de un usuario con las redes sociales, mensajes intercambiados entre terminales de usuario y estaciones base en una red celular, o eventos en las infraestructuras de una ix

13 Resumen ciudad inteligente (monitorización de la contaminación del aire, control del tráfico, del alumbrado público, etc.), los datos se generan como resultado de una serie de eventos que siguen una estructura de flujo temporal y, por tanto, sin fin. Procesar estos datos en lotes es contraintuitivo, pues no son conjuntos con un principio y un final, y en la mayoría de los casos una latencia elevada hace que el procesamiento tarde demasiado y su resultado sea inválido directamente. Por eso, en estos casos se empezó a aplicar una variante de Batch Analytics basada en procesar micro-lotes, que permiten obtener resultados más rápido. Aun así, gran parte de los esfuerzos de las plataformas de Big Data actuales se centran en buscar enfoques para llevar a cabo el procesamiento de datos en flujo (Stream Analytics en inglés) ex profeso que sean más eficientes. La mayoría de los proyectos de software para procesamiento de datos en flujo (como Apache Spark Streaming o Apache Storm) suelen proporcionar un equilibrio entre latencia, tolerancia ante fallos, consistencia y capacidad de proceso, ofreciendo una funcionalidad limitada. Por el contrario, Apache Flink (en adelante, Flink) ofrece una solución completa diseñada directamente para procesamiento de datos en tiempo real, al contrario que los proyectos anteriores, que han evolucionado desde enfoques basados en lotes. Es por ello por lo que este Trabajo Fin de Máster (TFM) gira entorno a Flink. Por otro lado, la consolidación del Internet de las Cosas (IoT, según sus siglas en inglés) ha desembocado en la que probablemente sea una de las mayores fuentes de flujos de datos en tiempo real. Son miles de millones los dispositivos que, continuamente y en todo el mundo, toman medidas de su entorno, produciendo una ingente cantidad de datos que deben ser procesados, pero no de cualquier modo. Por ejemplo, en una ciudad inteligente, las decisiones sobre los eventos de monitorización del tráfico o del medio ambiente no tienen sentido a menos que se tomen en tiempo real. De esto resulta que, en la mayoría de sus variantes, el IoT impone unos requisitos de latencia en el procesamiento de los datos muy estrictos, pues los actuadores tienen que responder lo antes posible. Claramente, el procesamiento de datos en flujo encaja a la perfección con el IoT. Sin embargo, a pesar de esta gran relación que existe entre el IoT y las técnicas de Big Data, la investigación industrial y académica ha comprobado que la integración de estos dos mundos no es fácil, siendo la heterogeneidad de dispositivos el mayor impedimento. A esta se une el problema de coordinar e identificar los múltiples dispositivos, de tal forma que, en la mayoría de los casos, los desarrolladores de aplicaciones para el IoT deben crear programas muy extensos y complejos para que funcionen con todos los dispositivos. Las herramientas de mashup para IoT, como Node-RED, de IBM, proporcionan a los desarrolladores una capa de abstracción que simplifica la creación de aplicaciones, especialmente para usuarios finales, que carecen de conocimientos de programación. Estas herramientas cuentan habitualmente con una interfaz gráfica (GUI, por sus siglas en inglés), en la que el usuario puede combinar una serie de componentes que se le ofrecen para generar aplicaciones nuevas de forma intuitiva y rápida. Pero no solo son útiles para usuarios sin conocimientos de programación, sino también para usuarios x

14 avanzados que quieran beneficiarse de un nivel de abstracción superior que simplifica su trabajo de desarrollo. Sin embargo, entre las funcionalidades de las herramientas de mashup para IoT actuales no se encuentra la creación de programas para procesamiento de datos en flujo. Este TFM toma este punto de partida, y tiene como objetivo principal hacer posible la creación de programas para procesamiento de datos en flujo con Flink en aflux, una herramienta de mashup para IoT desarrollada en el Departamento de Ingeniería de Software y Sistemas (Chair of Software and Systems Engineering) de la Universidad Politécnica de Múnich (Technical University of Munich, TUM). Para ello, se han diseñado, implementado y evaluado un conjunto de componentes que el usuario puede combinar utilizando la GUI de aflux para crear programas en Flink gráficamente. Son dos las preguntas que se pretenden responder con este TFM. En primer lugar, qué abstracciones son necesarias para modularizar los programas Flink, de tal forma que se puedan crear programas para procesamiento de datos en flujo gráficamente, a través de una herramienta de mashup para IoT? Y, en segundo lugar, cómo se puede ayudar a los usuarios finales durante el proceso de creación gráfica de programas Flink, en concreto para que la forma en la que combinan los componentes sea la correcta? Ambas se enmarcan en el reto que supone la integración de Big Data e IoT en sí misma, tan poco desarrollada hasta el momento, y en las limitaciones de las herramientas de mashup, que se ven limitadas precisamente por su simplicidad de uso. Desarrollo del proyecto En la primera fase del proyecto se han diseñado dos modelos conceptuales. El primero permite traducir automáticamente elementos especificados en una GUI a código ejecutable; el segundo permite evaluar continuamente la composición que hace un usuario al crear un mashup, para llevar a cabo comprobaciones semánticas y proporcionarle realimentación al respecto. El traductor incluye un sistema de actores, que ejecutan la lógica que se requiere para la generación de los programas Flink. Esta generación de código fuente debe ser lo más genérica posible, para garantizar que la flexibilidad y extensibilidad del enfoque implementado sean máximas. Al mismo tiempo, debe ser suficientemente específica como para que tenga sentido realizarla a través de mashups. Por su parte, el segundo modelo incluye la definición de una serie de reglas semánticas entre los componentes de mashup desarrollados, que restringen las posiciones en las que dichos componentes pueden encontrarse dentro del mashup. En la siguiente fase del proyecto se han implementado los modelos anteriores en el contexto de aflux, lo que ha dado lugar a un nuevo plugin para la herramienta que contiene todos los componentes gráficos necesarios para que el usuario pueda crear programas Flink gráficamente. El usuario obtiene directamente un ejecutable que puede desplegar en cualquier instancia de Flink, y todo ello sin necesidad de escribir código fuente. Además, el usuario recibe realimentación por distintas vías gráficas sobre el cumplimiento o incumplimiento de las reglas semánticas relativas a los componentes que ha incluido en su mashup. De esta forma, se asegura que el mashup creado se xi

15 Resumen traducirá en un programa Flink perfectamente ejecutable y sin errores. Finalmente, se ha llevado a cabo una fase de evaluación en la que se ha puesto a prueba el enfoque desarrollado. Para ello, se ha tomado el caso práctico de las ciudades inteligentes, que contienen miles de sensores IoT que crean datos constantemente. Más específicamente, se estudia el caso concreto de la ciudad de Santander, que participa en el proyecto europeo "SmartSantander", fundado por la Comisión Europea como parte del Séptimo Programa Marco (7PM). El objetivo de SmartSantander es proporcionar un entorno de pruebas para ciudades inteligentes completo, con un extenso despliegue de sensores y demás infraestructura inteligente que puede ser utilizado para el desarrollo de nuevas aplicaciones. En esta fase, se han estudiado varios escenarios en los que un usuario utiliza aflux para crear programas Flink que faciliten el acceso a los datos en tiempo real de SmartSantander. El resultado principal ha sido que es posible crear programas de procesamiento de datos en flujo gráficamente de forma muy fácil e intuitiva con el enfoque desarrollado en este TFM. El principal inconveniente identificado es que no todas las funcionalidades de Flink pueden programarse desde aflux, aunque sí son fácilmente integrables dada la extensibilidad del modelo desarrollado. Otras limitaciones vienen dadas por las herramientas de mashup en sí mismas, que proporcionan simplicidad de uso a cambio de limitar la flexibilidad de las opciones que el usuario final puede configurar. Conclusión Las aplicaciones para IoT necesitan cada vez más estar integradas con las plataformas para Big Data, para llevar a cabo análisis que den sentido a las ingentes cantidades de información que se generan en millones de sensores cada minuto. De entre todos los enfoques de Big Data, el procesamiento en flujo es el más indicado para eventos en tiempo real, pero sin embargo no está soportado por las herramientas de mashup para IoT, a pesar de ser un caso de uso que encaja a la perfección con este enfoque. A nivel de implementación, las principales contribuciones de este TFM son: un plugin para aflux que permite la creación de programas Flink gráficamente, una serie de mejoras en el núcleo de aflux que permiten el soporte continuo del usuario final validando los flujos gráficos que crea y un conector para Flink que permite el acceso directo a los datos de SmartSantander al crear programas de procesamiento de datos en flujo. Puede concluirse que las preguntas planteadas al principio del proyecto se han respondido con el trabajo desarrollado: el diseño, implementación y evaluación de un modelo genérico, expandible, escalable y flexible que permite que Flink, una de las soluciones de procesamiento de datos en flujo más punteras, pueda programarse gráficamente a través de aflux, una herramienta de mashup para IoT. Se han creado un conjunto de componentes gráficos que un usuario puede combinar de múltiples formas, siempre supervisado por un sistema que continuamente le proporciona realimentación sobre si lo está haciendo correctamente o no, para garantizar el éxito de los programas generados. xii

16 Contents Acknowledgments Abstract Zusammenfassung Resumen List of Tables List of Figures Listings List of Acronyms iii v vii ix xvii xix xxi xxiii 1. Introduction Motivation Research Questions and Goals Methodology Outline Background Stream Processing Requirements of Stream Processing Platforms Integration with the IoT Apache Flink The Concept of Time Windows Programming Flink Flink Against its Competitors Mashups Integration with the IoT aflux The Web Application The aflux Engine: Leveraging the Actor Model The aflux Plug-in Framework SmartSantander xiii

17 Contents 3. Related Work Nussknacker IBM SPSS Modeler Microsoft Azure Stream Analytics Conceptual Approach Translation and Code Generation Validation of Graphical Flows Implementation SmartSantander Connector for Flink The SmartSantander API The Data Model The Business Logic The Flink Data Source Mashup Components for aflux Java Code Generation Message Passing Among Actors Structure of the Actors Setting Up the Flink Environment SmartSantander Data Source Transformation Mashup Components Outputting a Data Stream Complex Event Processing Executing and Generating Job Automatic Mapper for Flink API The JavaParser Library Parsing the Flink API Analyzing the Flink API The Flink API Mapper End-User Continuous Support The ToolSemanticsCondition Back-End Job Validation Front-End Feedback Using Conditions in Mashup Components Evaluation The Evaluation Scenario Overall Considerations Use Case 1: Real Time Data Analysis Experiment Experiment Experiment xiv

18 Contents Experiment Use Case 2: Pattern Detection Experiment Experiment End-User Continuous Support Evaluation Critical Discussion Conclusions Main Contributions Future Work A. Appendix 101 A.1. SmartSantander Connector A.2. Mashup Components A.3. Flink API Mapper Bibliography 113 xv

20 List of Tables 5.1. Main Endpoints of SmartSantander REST API Properties of the "SmartSntndr Data" Mashup Component Properties of the "GPS Filter" Mashup Component Properties of the "Select" Mashup Component Supported Windows Application Programming Interfaces (APIs) Properties of the "Window" Mashup Component Properties of the "Window Operation" Mashup Component Properties of the "Output Result" Mashup Component Defining Contiguity in FlinkCEP Properties of the "CEP Begin" Mashup Component Properties of the "CEP New Patt." Mashup Component Properties of the "CEP Add Condition" Mashup Component Properties of the "CEP End" Mashup Component Semantics Conditions in the Mashup Components of the Flink Plug-in.. 71 A.1. Structure of the Traffic Dataset A.2. Structure of the Environment Dataset A.3. Structure of the Air Quality Dataset A.4. Mashup Components in the Flink Plug-in for aflux xvii

22 List of Figures 1.1. Latest Gartner s Hype Cycle, as of July Architecture of a Stream Processing Platform Different Concepts of Time in Flink Windows in Flink Overall structure of a Flink Program Analytics, IoT and Mashups High-Level Architecture of aflux Graphical User Interface of aflux Traffic Sensors in SmartSantander The Nussknacker Dashboard Conceptual Approach for Translation and Code Generation Live Data Provided by the SmartSantander API The Location Picker Property Programming Flink from aflux ToolSemanticsCondition in the aflux Tool Core Validation Errors Rendered in aflux s Front-End End-User Continuous Support The Evaluation Scenario Flows in aflux for Use Case Tumbling vs. Sliding Windows in aflux Use Case 1, Experiment 1 (Tumbling Windows) Use Case 1, Experiment 1 (Sliding Windows) Changing Filtering Location from aflux Use Case 1, Experiment 2 (Tumbling Windows) Use Case 1, Experiment 2 (Sliding Windows) Use Case 1, Experiment 3 (Tumbling Windows) Use Case 1, Experiment 3 (Sliding Windows) Different Window Operations in aflux Use Case 1, Experiment 4 (Tumbling Windows) Use Case 1, Experiment 4 (Sliding Windows) Flow in aflux for Use Case Sample Configuration of Components in aflux for Pattern Detection xix

23 List of Figures Use Case 2, Experiment Use Case 2, Experiment Step-by-Step Flow Composition in aflux Details About the Errors in the "Window" Component A.1. Model used for the SmartSantander Connector A.2. The SmartSantander Connector A.3. The Flink API Mapper xx

24 Listings 5.1. Deserialization of JSON Resources with Annotations Constructor of TrafficObservation.java TrafficObservationDeserializer Registering the Traffic Deserializer in Gson Instantiation of SmartSantanderObservationStream SmartSantanderSource Sample Code Generated by the "SmrtSntnder Data" Component Sample Code Generated by the "GPS Filter" Component Sample Code Generated by the "Select" Component Sample Code Generated by the "Window" Component Sample Code Generated by the "Window Operation" Component Sample Code Generated by the "Output Result" Component Sample Code Generated by the "CEP Begin" Component Sample Code Generated by the "CEP New Patt." Component Sample Code Generated by the "CEP Add Condition" Component Sample Code Generated by the "CEP End" Component Required Code to Select Two Different Gas Levels Tumbling vs. Sliding Windows in Java Changing Filtering Location in Java Different Window Operations in Java Required Code to Create a Pattern that Detects Traffic Jams A.1. Sample SmartSantander API Response A.2. Generating an Anonymous Class with JavaPoet A.3. Output of Listing A A.4. ExecuteAndGenerateJobActor.java A.5. Basic Example with FlinkCEP xxi

26 List of Acronyms API Application Programming Interface. AST Abstract Syntax Tree. CEP Complex Event Processing. CLI Command-Line Interface. CSV Comma Separated Values. DAG Directed Acyclic Graph. EUD End-User Development. GUI Graphical User Interface. HTTP HyperText Transfer Protocol. IDE Integrated Development Environment. IoT Internet of Things. JSON JavaScript Object Notation. MDSD Model-Driven Software Development. POJO Plain Old Java Object. REST REpresentational State Transfer. SQL Structured Query Language. UML Unified Modeling Language. xxiii

28 1. Introduction The amount of data that needs to be processed in the Internet is constantly growing. Big Data is an umbrella term to refer to the new approaches that are needed to properly manage and analyze these ever-growing amounts of information, which differ from traditional data in the so-called 4 V s: volume, variety, velocity and veracity [1]. Each of these properties entails a different challenge that Big Data platforms must address: they must be capable of handling great amounts of data that come from different sources (hence with different structures), and do it quickly enough while, at the same time, provide results that really matter. In many use cases, like stocks transactions, user interactions with social networks, exchange of messages between end-user terminals and mobile base stations, or events in smart cities infrastructures (e.g. pollution monitoring, traffic control, lamp posts, etc.), data occur as a series of events that follow a flow-like structure. However, rather than treating them as a stream, traditional Big Data approaches have processed data as if they were finite, bounded datasets called batches. This is known as Batch Analytics. Batch Analytics systems rely on collecting data before it is analyzed, so they simplify the whole process notably. However, they require great storage facilities, as they are not capable of handling data as they arrive. In any case, the most currently well-known Big Data paradigms, like MapReduce [2], are based on batch processing. Batch processing behaves reasonably well when assessing large volumes of data is the priority, especially regarding bounded datasets. However, they fail in providing good results with low latency, thus being inappropriate for real-time systems. They can nonetheless be applied to do some sort of stream processing, in what are known as micro-batches, but clearly there must be a more efficient way to do true Stream Analytics. The main strengths of Stream Analytics involve also great challenges, like ensuring low latency without abandoning high throughput, fail tolerance and consistency. For this reason, most stream processing software projects usually take a trade-off among these challenges and focus just on ensuring a subset of them. This is the case of the open-source projects Apache Spark Streaming 1 and Apache Storm 2. Both evolved from a Batch Analytics system, so they mainly work with micro-batches[3, 4]. On the contrary, Apache Flink 3 (in this report it will be written just "Flink") takes no trade-off; instead it provides a complete solution focused on stream processing. For this reason, Flink is the data processor that this thesis addresses. 1 [Online] 2 [Online] 3 [Online] 1

29 1. Introduction The consolidation of the Internet of Things (IoT) has brought probably one of the most remarkable sources of stream-like Big Data. Indeed, with thousands of millions of sensors taking measurements of their environment all over the world, an incredibly extensive amount of data is constantly being produced [5]. Take, for instance, any scenario in a smart city, such as traffic monitoring, environment tracking, etc. Decisions in these scenarios need to be taken in real-time. Therefore, in most cases the IoT clearly enforces strict latency requirements, because actuators need to take action as soon as possible as a result of processing the measured data. Stream Analytics fits the IoT like a glove. In spite of this strong bond between IoT and Big Data analytics [6], enabling Stream Analytics in IoT platforms is no easy task. There are many challenges to it, being probably the diversity of IoT devices the most important one of them. If we consider also identity of devices and coordination among them, usually application developers have no alternative but to create complex boilerplate code to address all devices at once [7]. IoT mashup tools, like Node-RED 4, provide them with an abstraction layer to simplify the creation of applications, especially for non-programmer, end users. However, they do not support the creation of Stream Analytics flows so far. This thesis starts from this point and is aimed at enabling Stream Analytics with Flink in aflux, an IoT Mashup Tool developed at the Chair of Software and Systems Engineering of the Technical University of Munich Motivation There is a great number of challenges that encourage undertaking this project. For starters, the integration between the Big Data and IoT worlds is itself an issue, as very little research has been done on this topic. In fact, as it can be seen in Fig. 1.1 (taken from [8]), the IoT platform itself is currently considered as a key platform-enabling technology, to be adopted within the next 2 to 5 years. Similarly, connected homes one of the main IoT use case scenarios is also raising great expectations for the upcoming years. It is clear that it is the right moment to do research in IoT. The limitations of mashups also apply here. Mashup tools have many advantages, especially for end users with no programming skills, but they also entail some limitations which derive from their simplicity of use. The domain of application of the mashup is critical [9], as it determines the way its mashup elements are designed, so a complete generality cannot be reached and should in fact be avoided because it would go against the main advantage of mashups: its ease of use. Thus, the level of abstraction that is accomplished should be generic enough to allow multiple variations of the created jobs but also specific enough to the domain of application so that end users find it easy (and even fun) to develop new applications. 4 [Online] 2

30 1.1. Motivation Figure 1.1.: Latest Gartner s Hype Cycle, as of July 2017 [8]. The IoT platform just left the innovation trigger area and is supposed to be widely adopted within 2-5 years. In this thesis, the domain of application is focused on Smart Cities, which contain thousands of IoT sensors that constantly create data. More specifically, the city of Santander, in the North of Spain, participates in a European project named "Smart- Santander", which was in its start funded by the European Commission as part of the 7th Framework Program (FP7) [10]. The aim of SmartSantander is to provide a full smart-city testbed with a complete deployment that leverages Future Internet to become a reality [11]. SmartSantander serves as a practical use case scenario for this thesis. It has been taken into consideration both when designing the modularization of Flink API and when evaluating the outcome of the implementation. Some part of the code is also planned to be contributed to the community, to allow others to experiment with the data from SmartSantander as it has been done in this thesis. The integration of Flink and an IoT mashup tool like aflux constitutes a great challenge as well. The conceptual approach of aflux must be studied in detail to fully understand how it behaves, and hence be able to design a modularization of Flink programs that suits the architecture and behavioral model of aflux. Another issue when it comes to modularizing the API of Flink is that the code generation should also be made as generic as possible. As many Flink APIs as possible should be supported, providing an abstraction to end-users of aflux, while at the same time ensuring some specificity derived from the domain of application. 3

31 1. Introduction As it can be seen, there are many important challenges that motivate this thesis. In the following section the main goals to face them are stated Research Questions and Goals The main task of this thesis is to enable the creation of Flink programs from aflux. By designing and implementing a set of new mashup components inside aflux (packaged in the form of a plugin), IoT application developers will be able to create Flink jobs graphically. But not only do they need the tools to create Flink programs, they also need support on how to use them, especially considering that the target user of mashup tools are profiles with no programming skills, or in general any end user that is not willing to take care of coding. Thus, two research questions are devised in the scope of this thesis: Research Questions #1 Which abstractions are necessary to modularize Flink programs so that they can be created from flow-based, graphical mashup tools? #2 How can end users get support during the process of creating Flink programs graphically? These questions remain in the context of this thesis, which is aimed at somehow giving a response to them. However, they should be narrowed down to have a properly defined problem that can be addressed completely in this thesis. For instance, not a whole modularization of Flink programs has been sought, but just a subset of the Stream Analytics functionalities. The same applies to the support stated in question #2: it will be limited to enforcing a certain order of the mashup components in the Graphical User Interface (GUI) of aflux. Consequently, the revisited questions are as follows: Research Questions Revisited #1 Which abstractions are necessary to modularize Flink streaming programs so that they can be created from flow-based, graphical mashup tools? #2 How can end users get support during the process of creating Flink programs graphically so that they place visual components in the right order? After defining a set of the different modules that compose a Flink program, a mashup component has been created out of each of them. In the end, the user will be able to combine these building blocks in whichever way they want, to create a Flink job. And 4

32 1.3. Methodology they will get continuous support on how to fulfil this task successfully, so as to avoid errors that could result in the job not being created properly. As a result, the goals of this thesis are as follow: Perform a complete analysis of the Flink software project, to study its structure and come up with a way to modularize its programs. This approach should be flexible, generic and extensible. Integrate this modularization approach into aflux, by developing a set of mashup components that can be combined using the tool. Provide the users with a support mechanism that helps them in the creation of Flink programs with the developed artifacts. Contribute the implemented artifacts and approaches to the community of aflux, Flink and SmartSantander Methodology To accomplish the goals stated in Section 1.2, the following methodology has been followed: 1. Literature review. In this stage, the state-of-the-art with regard to stream analytics and IoT mashup tools were analyzed. 2. Design. This step dealt with seeking for an approach to frame Flink programs to enable its creation from aflux. It included the following tasks: a) Analyze how mashup tools in general, and aflux in particular, work, both from the front-end and back-end perspectives. b) Analyze how Flink works and its APIs, to fully understand what it provides and how it is achieved. c) Design a set of mashup components that allow the creation of Flink jobs. These components are mapped to Akka actors in the background, which execute the required logic to generate Flink programs. A right modularization has to be designed, so that the approach is extensible and supports error traceability. d) Design an approach to give users continuous support on the way they are using the mashup components, to prevent errors. 3. Implementation. In this step, the results from the design phase were used as input to a development stage of all the necessary components. Java has been used as the main programming language. The main outcome of this step is a set of mashup elements for aflux that can be combined to create Flink jobs. 5

33 1. Introduction 4. Evaluation. This step was about assessing how easy it is to create Flink jobs with the designed approach. The extensibility and limitations of the implemented code-generation technique were assessed, analyzed and some conclusions were drawn. This whole project work has been undertaken in the Chair of Software and Systems Engineering of the Department of Informatics of the Technical University of Munich Outline The structure of this document is as follows: Chapter 2 contains the background that is of relevance for this work. Concepts like Stream Analytics and mashup tools are presented, as well as Apache Flink and aflux, the most relevant software projects that take part in this work. Finally, an overview of the SmartSantander project, the main use-case scenario on which this thesis is based, will be given. After that, the relevant works related to the matter of this thesis are discussed in Chapter 3. In Chapter 4 the conceptual approach to address the research questions is described, and the most important design choices are highlighted. Chapter 5 deals with the implementation tasks that have been accomplished as part of this thesis. Fine details are given to show how each component has been designed and works, and the main problems that have been faced are outlined. The implemented artifacts are evaluated in Chapter 6, by means of a set of use cases and examples that show how easy it is to experiment with the data from SmartSantander with the implemented approach. In light of the evaluation results, a critical discussion of the work accomplished is provided at the end of this chapter. Finally, the overall conclusions are drawn in Chapter 7, in which the main contributions of the thesis are also highlighted, and some future lines are identified. 6

34 2. Background In this chapter, the concepts that are required for a better understanding of this thesis are presented. On the one hand, a description of Stream Analytics is provided, then Flink is described as a software platform to perform those analytics. On the other hand, mashups are described from a theoretical perspective, and then aflux is presented as a practical example of IoT mashup tools. Finally, the SmartSantander project, which is the main use case that has been considered for this thesis, is shortly introduced Stream Processing The term "Big Data" was coined to refer to the great amounts of data that need to be processed, not only in terms of volume but also in terms of other subtler parameters like velocity. The MapReduce framework, initially presented by Google [2, 12], was then implemented as an open-source project by the Apache Software Foundation, called Hadoop 1. Both of them had been designed for processing data batches with high throughput, and actually could do it in a very efficient way, though these tasks could take several hours or even days [13]. These time restrictions are simply unacceptable for many real-world applications that require the shortest response times available (a couple of minutes or seconds), like most IoT applications. Even more important is the fact that the aforementioned batch processing enforces all the data being available before it can be fed to the processing system. But, again, sometimes it is necessary to make computations before all the data have arrived, in order to have real-time results, regardless of whether or not all data input is already in the system. The requirements and essence of IoT applications and scenarios encourage the use of Stream Analytics, because they meet the characteristics of streaming data: they produce continuous, unbounded and sometimes disordered streams of events [14]. In these contexts, data monitoring is also desirable, in order to search for patterns, anomalies, etc. Thus, Stream Processing is about a continuous processing of data, not only to provide instantaneous answers, but to endlessly monitor data in the wide sense. It is important not to mix Stream Processing with Real-Time Processing. The second one is an evolution of MapReduce that focuses on providing results quickly, but no stream of events is considered. For instance, the in-memory processing approach of Apache Spark makes for a real-time solution, though no streaming is supported in the first place [13]. 1 [Online] 7

35 2. Background Of course, it is possible to somehow adapt MapReduce to data streams, by breaking the data into a set of finite batches. This is known as micro-batching and it is what Apache Spark Streaming does [15]. However, Spark is not a stream-first platform, unlike other approaches like Apache Storm and, especially, Flink Requirements of Stream Processing Platforms An ideal stream-first platform should meet the following requirements [16]: Low latency. Streaming platforms usually make use of in-memory processing, in order to avoid the time required to read/write data in a storage facility. They should be capable of handling data as they arrive. For this reason, a message transport is vital in a streaming architecture, as it can be seen in Fig This transport layer should feature not only extraordinary performance, but also decoupling between data publishers and data consumers [17]. Data querying. Streaming platforms should make it possible to find events in the whole data stream. Typically, an Structured Query Language (SQL)-like language is advised [16]. But, since data streams never come to an end, there has to be a mechanism to define the limits of a query (otherwise it would be impossible to query streaming data). This is where the window concept takes part. Windows define the data in which an operation may be applied, this is why they make key elements in stream processing. Out-of-order data. Since the streaming platform does not wait for all the data to be available, it must feature a mechanism to handle data coming late or even not coming whatsoever. A concept of time needs to be introduced, as opposed to batch analytics systems, which process data in chunks regardless of which arrived before. Replicable computations and stored state. In spite of the disorder that may exist in data, computations need to be repeatable. Besides, they may sometimes include information from the past, e.g. to make comparisons. This is to say that there has to be a system state, and it needs to be consistent [18]. High availability and scalability. Stream processors will most probably handle ever-growing amounts of data, and in most cases other systems could rely on them, e.g. in IoT scenarios. For this reason, it has to be guaranteed that the stream processing platform is reliable, fault-tolerant and can handle just about any amount of data events. High throughput. Scalability and parallelism are key to enable high performance in terms of the amount of data that can be processed. Stream Processing systems are often asked for a real-time performance, so they are useless if they cannot handle new data as they arrive: they should behave properly under stress [17]. 8

2.1. Stream Processing Events Real-Time Streams Transport Stream Processor Store, Transport, Serve Stored Past Data Figure 2.1.: Architecture of a Stream Processing Platform The first approaches to

36 2.1. Stream Processing Events Real-Time Streams Transport Stream Processor Store, Transport, Serve Stored Past Data Figure 2.1.: Architecture of a Stream Processing Platform The first approaches to stream processing, like Apache Storm and Apache Spark Streaming, used to focus on some of these requirements, such as low latency in the former example and high throughput in the latter one [3]. They were built on top of MapReduce, as stated above. An improvement was made with the so-called lambda architecture, a very well-known approach [17, 19, 20] that combines batch and stream-like approaches in order to reach short response times (in the order of seconds). This approach has some advantages, but one critical downside: the business logic needs to be duplicated into the stream-like and the batch processors. Experts agreed that there had to be a better way to face this challenge [21], and stream-first solutions like Flink appeared, promising to meet all the requirements [17]. They all have a similar architecture, which is depicted in Fig Apache Kafka 2 and Flink can be used in the transport and stream processing components respectively Integration with the IoT Stream Analytics makes sense for a wide variety of scenarios and practical use cases, ranging from the financial sector, telecommunications, and even retail and marketing [17]. In general, any area or field of study in which data are generated as a flow of events is suitable for Stream Analytics. The IoT is composed of millions of sensors that measure a quantity in the real physical world, and actuators that need this data to be processed with almost no latency, to ensure a proper response in time. These requirements encourage the use of Stream Analytics to the greatest extent [22]. For this reason, a lot of work and research has been conducted to bring the world of Big Data (especially Stream Analytics) and the IoT together [23]. In some cases, new semantics to specifically target IoT are designed [22]. Some proposals are based on the aforementioned lambda architecture [24] to combine both 2 [Online] 9

37 2. Background batch and stream processing, while others dive into new network paradigms like Fog Computing [25] to achieve near-to-zero network latency. In the world of Big Data management in a more general sense, many frameworks have been proposed. For instance, a complete data-centric design for a smart city is presented in [26], with a framework that spreads along all the levels of the communication stack: from the physical level to applications. They emphasize the necessity to make sense out of all the data generated in the IoT, just like in the cognitive IoT paradigm presented in [27]. Clearly, there is great interest in bringing these two worlds together. Still, very few research has been done in enabling fast prototyping of these analytics applications. To this purpose, mashups offer a powerful yet simple approach that even users with no programming skills can use Apache Flink In its official website [18], Apache Flink is defined as "a framework and distributed processing engine for stateful computations over unbounded and bounded data streams". In more detail: It is a framework, because it allows the creation of programs through a set of public APIs, which currently support Java, Scala and the REpresentational State Transfer (REST) architecture. A web interface is also available to easily manage the jobs in execution. It is a distributed engine, and it is built upon a distributed runtime [28] that can be executed in a cluster, to benefit from high availability and high-performance computing resources. A wide range of deployment options are supported [29]: YARN, Mesos, Docker, Kubernettes, Amazon Web Services, Google Compute Engine, and even Hadoop. It is based on stateful computations. Indeed, Flink offers exactly-once state consistency [17], which means that it is able to ensure correctness even in case of failure. Flink is also scalable, because the state can also be distributed among several systems. It supports both bounded and unbounded data streams. Flink cannot only process data as it arrives, but also process a historical stream of events. This is not the same as batch processing, because data are still processed as streams it is just that the time reference is taken back to the past. In the latest releases, the support of bounded data sets has been extended to perform proper batch analytics, as this is indeed a special case of streaming [17]. Flink achieves all this by means of a distributed dataflow runtime that allows a real stream pipelined processing of data [28]. 10

38 2.2. Apache Flink The Concept of Time As it was stated in Section 2.1, a streaming platform should be able to handle time, because this is the reference that is used for understanding how the data stream flows, that is to say, which events come before or after a given one. Time is used for creating windows, and to perform operations on streaming data on the wide sense. Time is the key to solve data disorder [17]. Flink supports several concepts of time. Fig. 2.2 depicts them using the streaming architecture presented in the previous section. Events Real-Time Streams Transport Stream Processor Store, Transport, Serve Stored Past Data Processing Time Event Time Ingestion Time Figure 2.2.: Different Concepts of Time in Flink Event Time Event time refers to the time in which an event took place. It is usually appended to the event metadata by the event producer as some timestamp attribute. By featuring event-time semantics, Flink allows processing the data considering where the events happened in the real world. No matter if they arrive out of order, or if the message transport makes a mess of them, or if they come straight from a recorded source. To achieve this, Flink support a mechanism of additional timestamps called watermarks, which allow making progress in event time [30]. Watermarks flow along with the data, and they indicate that by that point of the stream, all the events prior to a certain timestamp must have arrived. Now, the processor may advance its internal event-time clock. Watermarks permit the processing of out-of-order streams [30, 17]. And even if events arrive after a later watermark has already been processed, Flink can recalculate computations or send them to a side output. 11

39 2. Background Processing Time This time is related to the machine in which the streams are processed. Its internal system clock is used to process data. Of course, it provides the shortest latency [30], but it is not fault-tolerant and depends on the speed at which the messages arrive from the network and the message transport. Flink features processing-time semantics for those applications which do not need exact results. Ingestion Time This time concept is between event and processing time. In this case, there are watermarks (like in event time), but they are assigned automatically by Flink as the data enters it. They are therefore simpler in this sense, but they do not allow processing out-of-order streams. Internally, it is treated as event time [30] Windows Windows are a basic element in stream processors. Flink supports different types of windows [31], and all of them rely on the notion of time that has been defined in the previous section. Tumbling windows (Fig. 2.3a). With a specified window size, tumbling windows assign each event to one and only one window, without any overlap. Their size is fixed. Sliding windows (Fig. 2.3b). Their size is also fixed, but an overlap called slide is allowed. Session windows (Fig. 2.3c). They are really interesting for some applications, because sometimes it may be interesting to process events in sessions. This is something that cannot be done in micro-batching, as shown in [17]. Global window (Fig. 2.3d). By assigning all the elements to one single window, this approach allows the definition of triggers, which tell Flink when exactly the computations should be performed. It is a completely customizable window Programming Flink As stated above, Flink is a framework, and it can be used to develop stream processing jobs that can be applied on just about any data source. To this purpose, it offers a wide set of APIs, both in Java and in Scala, that expose the capabilities of Flink on three different abstraction levels: Stateful stream processing. It features full flexibility, by enabling low-level processing and control. In most applications, such a degree of possibilities is not required and can even be cumbersome. 12

40 2.2. Apache Flink Window 1 Window 2 Window 3 Window 2 Window 4 Window 1 Window 3 A B C D A B C D Size Slide Size (a) Tumbling Windows (b) Sliding Windows Window 1 Window 2 Window 3 A B C D A B C D Session gap (c) Session Windows (d) Global Window Figure 2.3.: Windows in Flink Core level. This is the widest abstraction level. By means of both a DataStream API and a Dataset API, Flink enables not only stream processing but also batch analytics on "bounded data streams", i.e. data sets with fixed length. In this thesis, only the DataStream functionalities have been considered, because they are what make Flink disrupting but it should be kept in mind that Flink can successfully handle batch jobs. Declarative domain-specific language. Flink offers a Table API as well, which provides a high-level abstraction to data processing. With this tool, a data set or data stream can be converted to a table that follows a relational model. The Table API is more concise, because instead of the exact code of the operation, logical 13

41 2. Background operations are defined [32], but at the same time less expressive than the core APIs. In the latest Flink releases, an even-higher-level SQL abstraction has been created, as an evolution of this declarative domain-specific language. Libraries. On top of the user-facing APIs stated above, some libraries with special functionality are built. The added value ranges from Machine Learning algorithms (currently just available in Scala) to Complex Event Processing (CEP) and graph processing [17]. In any case, the structure of a Flink program (especially when using the core-level APIs), is the one shown in Fig Data from a source enters Flink, where a set of transformations is applied to them (window operations, data filtering, data mapping, etc.). The results are in turn yielded to a data sink. Data Source Transformations Data Sink Figure 2.4.: Overall Structure of a Flink Program [33] Flink Against its Competitors As a final note, it is interesting to see how Flink behaves when it is compared to other popular platforms like Apache Spark Streaming and Apache Storm. It has been demonstrated [34] that Flink and Storm, as true Stream Analytics processors, show much lower latency than Spark Streaming, yet Storm seems to include a lot of overhead that could be a problem with high throughputs On the other hand, Spark Streaming, being based in micro-batches, can handle higher throughputs with the right configuration [34]. But it does not perform automatic optimization of job execution, unlike Flink [35]. It seems like, because of its pipelined execution, automatic optimizations and ease of configuration, Flink should be chosen in most scenarios [35]. In any case, not a single framework is suitable for all data and jobs: as an example, Spark is nearly 2 times faster than Flink for large graph processing [35]. 14

42 2.3. Mashups 2.3. Mashups There is much controversy in research on how to define a mashup [36]. They are usually considered to be applications that have been built by combining different elements that were previously available and reuse data or business logic [37]. These building blocks are commonly referred to as mashup components, and the mashup logic is the way in which they are assembled altogether [36]. To draw mashups a mashup tool is used, which may or may not have a web interface. Mashups open a wide range of possibilities to software engineering and software development. For starters, there is Model-Driven Software Development (MDSD), which is about using a detailed model that a machine can understand to automatically create artifacts that are part of the final software deliverable [38]. MDSD offers higher speed at a lower cost [36], because unnecessary low-level details are hidden from the model. It is important not to confuse MDSD with Model-Based Software Development, which makes use of models in the software design stage, but these models do not need to be fully detailed they are just an aid for developers. MDSD ultimately enables End-User Development (EUD), which refers to non-professionals able to develop software by means of (in this case) a mashup tool. Although mashups are typically simple applications, they provide a set of benefits that undoubtedly motivate the development of this modeling approach. They enable fast prototyping, so that even skilled developers may find mashups useful, because they are abstracted from a lot of boilerplate code that can be easily reused instead. Besides, their target user ranges from end users with no programming skills to experienced software engineers. Mashups provide quick responses to simple problems and can be even be fun to use [36]. Depending on what the mashup components represent, typically mashups and mashup components are classified according to three levels of abstraction: Data mashups, when their components allow composing low-level interfaces and raw data (e.g. integrating data from a REST interface). Logic mashups, when their components provide access to business logic or functionalities (such as any given algorithm). User-interface mashups, when they provide a GUI that allows combining fully operating components that provide an added value to end users (like a customizable news web dashboard). Hybrid mashups, when they combine components of different types, to provide richer capabilities to the mashup developer. In this thesis, hybrid mashups remain as the most thoughtful alternative, as long as its limitations (e.g. conflicts between elements of different levels) are taken into consideration. 15

43 2. Background Integration with the IoT The benefits of mashups make it suitable for many fields of application, ranging from the traditional web to mobile environments [39, 40], both for end users and also large enterprises [41]. Among all of them, the IoT is indeed one of the most promising fields of application for mashups as well. Certainly, mashups suit the IoT use case like a glove. For instance, one of the greatest challenges in IoT is the high degree of devices heterogeneity, which requires an abstraction from the low-level layer to address all devices at once. For this reason, a lot of IoT mashup tools have been developed recently both in research and in production environments [42]. Here are the most remarkable ones: Node-RED is a mashup tool developed by IBM which offers a set of visual components that stand for JavaScript code. It is suitable for real-time applications that usually run on embedded hardware and handle significant amounts of data. IoT-MAP [43] is a solution that focuses on the mobile world. The authors propose the creation of a platform that brings both mobile application developers, smart things manufacturers and end users together. IoTLink [44] is said to be a development toolkit based on MDSD, that makes use of a high-level model to enable the distributed composition of devices and services into a mashup to visually define how they represent the concept of a "thing". The model can then be translated into Java, for further refinement by a human developer. Other approaches treat IoT mashups as a middleware that can be leveraged to enable wider applications. For instance, the authors in [45] present a lightweight model that uses the REST principles as a basis to build an IoT-mashup-development platform. However, these approaches do not support data analytics, especially Stream Analytics. Instead, they aim to provide a way to create applications for the IoT (which of course is of great value), rather than applications for analyzing the data that is produced inside them. There is a lack of tools that combine the development of IoT applications and stream analytics [7, 46]. Recap Current approaches of IoT mashup tools are limited in terms of Stream Analytics, whereas Data Analytics for IoT do not feature any type of EUD. It seems necessary to bring this together to offer added value to IoT application developers and users. 16

44 2.4. aflux 2.4. aflux To overcome the limitations specified above, the Chair of Software and Systems Engineering of the Technical University of Munich has created aflux [47], an IoT mashup tool aiming to support Stream Analytics when graphically developing services and applications for the IoT [48]. As shown in Fig. 2.5, it brings together the IoT and analytics using mashups as the enabling technology. IoT Data Frameworks Big Data & Stream Analytics aflux Internet of Things Analytics Mashup Tools Mashups IoT Mashup Tools Figure 2.5.: Analytics, IoT and Mashups Among the most relevant features of aflux are a set of new semantics to support asynchronous execution patterns and multi-threaded applications, which most IoT mashup tools, like Node-RED, still lack. Querying Big Data Analytics systems is supported in aflux, as well as some "unified analytics" that enable addressing different systems at once [49]. Fig. 2.6 shows the overall architecture of aflux. Each element will be explained in detail: first the web application will be covered, and then the rest of the subsystems of the backend The Web Application The web application is composed of two main entities: the front-end and a REST API back-end. 17

45 2. Background Actions Containers State Components REST API Main Plug-ins Common Analytics Front-End Engine Back-End Plugins aflux Figure 2.6.: High-Level Architecture of aflux The Front-End The front-end of aflux provides a GUI to the creation of mashups. It is based on React 3 and Redux 4, two frameworks for building user interfaces. The focus of these frameworks is on structuring the application as a set of components, be them either visual GUI components or a bundle of several of them. When the user interacts with these components, they generate actions that are dispatched and handled by containers, that contain the business logic. Apart from this, there is an overall stored state of the application, used to render the visual components and modified by the business logic. Fig. 2.7 shows the GUI of aflux. As it can be seen, mashups are created by drag-anddropping mashup components from the left panel. Mashup components are loaded from plug-ins, which are explained below. The application shows a console-like output in the footer, and the details about a selected item are shown on the right panel. The following elements are considered in the GUI of aflux: Mashup components are the building blocks of the tool. A Flow is a synonym for mashup. It is composed of several mashup components that the user wires together. The tool also supports subflows, which are user-created mashups that are then available as a mashup component, to encourage reuse. An Activity is a set of flows that are located in the same canvas. aflux supports adding several flows to the same activity. A Job is the most general data structure in aflux. It can contain several activities, which can be selected from the tab bar in the bottom part of the application. 3 [Online] 4 [Online] 18

2.4. aflux Available Mashup Components Application Header & Menu Bar Side Panel Mashups Activity Tabs Canvas Add-Plug-in Button Console-like Output Figure 2.7.

46 2.4. aflux Available Mashup Components Application Header & Menu Bar Side Panel Mashups Activity Tabs Canvas Add-Plug-in Button Console-like Output Figure 2.7.: Graphical User Interface of aflux The Back-End The jobs that the user has created, as well as the main configuration parameters of the GUI and some internal metadata are stored in a MongoDB 5 database for persistency. Besides, the logic to execute the flows in a job is not located in the front-end. The back-end of aflux offers a REST API for the GUI to trigger these operations. The back-end of the web application is based on the Spring framework for Java The aflux Engine: Leveraging the Actor Model One of the most important features of aflux is its execution model. Addressing it in detail is outside of the scope of this document, yet some important remarks are needed in order to understand the implementation approach adopted for this thesis. When a flow is sent to the back-end, it must be translated to an internal model, as it was explained in Section 2.3. In aflux, this internal model is a graph called the "Flow Execution Model" [49]. This model is composed of actors, since aflux makes use of the actor system approach. More specifically, it makes use of the Akka actor system 7, made for Java. 5 [Online] 6 [Online] 7 [Online] 19

47 2. Background In an actor system, actors encapsulate both state and behavior [50]. When an actor receives a message, it starts to perform its associated computations, and it may send a message to another actor when finished. In aflux, messages can only be sent asynchronously [47]. Concurrency of actors is also supported. The actor system indeed fits the mashups use case, in which mashup components communicate also through messages (e.g. Node-RED offers the message.payload to send information from one node to another). The Java back-end and engine are based on Java and use Maven 8 for project management and comprehension The aflux Plug-in Framework Not only is aflux an IoT mashup tool offering novel features like asynchronous flows, its functionality is also expandable. The aflux engine offers a java API that, despite being simple, enables the development of plug-ins. This API abstracts all the internal logic related to mashup components, such as how they are treated by the web application and the flow execution model, and leaves just two main issues to be handled by the plug-in: how the mashup component looks in the GUI, which includes its properties, and which is the business logic that the component stands for, i.e. the business logic of the actor associated to it. Plug-ins are Maven projects that are packaged (in Java s.jar format) and directly loaded from aflux in run-time. The dependencies that they required are also integrated inside the.jar, to make it standalone. With this approach, several plug-ins are already available, which range from generalpurpose utilities to a common analytics language, including input/output operations with Kafka, files, MQTT (a widely-used connectivity protocol for IoT 9 ), etc SmartSantander Smart Cities are considered to be one of the most relevant fields of study of Future Internet, and because of that it is a popular topic both in research and industry [51]. Cities make indeed a great testbed for IoT technology, because it has a direct impact on citizens. For this purpose, the European Commission, as part of the FP7 [10], invested in the creation of a complete testbed environment in the city of Santander, Spain, that can be used to develop meaningful IoT applications. In the context of the SmartSantander project [52], a whole experimental research facility for smart cities has been developed, so that it can be deployed in other cities apart from Santander. In fact, there are already nodes in Guilford, UK; Lübeck, Germany and Belgrade, Serbia [11]. 8 [Online] 9 [Online] 20

2.5. SmartSantander The city-scale deployment was rolled out in three phases, starting in November 2011 [52], and it addresses many application domains: traffic, parking, environmental monitoring,

They are both located at static locations (streetlights, façades, bus stops) as well as embedded in public vehicles like buses and taxis. Figure 2.8.

48 2.5. SmartSantander The city-scale deployment was rolled out in three phases, starting in November 2011 [52], and it addresses many application domains: traffic, parking, environmental monitoring, parks and gardens irrigation, etc. Around 3,000 IEEE devices [53], 200 GPRS 10 modules and 2000 smart tags (RFID, QR) code labels compose the SmartSantander facilities. They are both located at static locations (streetlights, façades, bus stops) as well as embedded in public vehicles like buses and taxis. Figure 2.8.: Traffic Sensors in SmartSantander Apart from IoT nodes, repeaters and gateways, the SmartSantander facility is structured in four subsystems: an Authentication, Authorization and Accounting subsystem, a Testbed Management subsystem, an Experimental Support subsystem and an Application Support subsystem. To encourage the development of new applications, the SmartSantander facility has an open API, managed both as part of the project and from the City Hall of the city. As an example, Fig. 2.8 contains a map of Santander with the locations of the traffic sensors, which have been retrieved from the API and represented using a cloud computing service 11. The SmartSantander facility is also accessible from other external APIs, like the global instance of the FIWARE Context Broker [54, 55], a publish-subscribe component for handling IoT data that is part of the European-funded FIWARE Platform [Online] 11 [Online] 12 [Online] 21

50 3. Related Work In this chapter, the most relevant works that apply to the matter of this thesis are presented. It is important to highlight that, as it has been stated above, very few approaches exist in the topic of integrating EUD and Stream Analytics for IoT Nussknacker The most relevant work for this thesis is Nussknacker [56]. It is a GUI application to design, deploy and monitor Flink jobs. It is an open-source project (available in GitHub 1 under the Apache License 2.0), and it is production-ready actually, the major Polish telecommunications company use it since the beginning of 2017 [56]. Figure 3.1.: The Nussknacker Dashboard As it is still being developed, there is not much documentation about Nussknacker. Yet three elements in its architecture can be identified: An engine, whose aim is to transform the graphical model created in the GUI (in JavaScript Object Notation (JSON) format) into a Flink job. A standalone user interface application, which allows both the development and deployment of Flink jobs. It is written in Scala and incorporates data persistency and a Flink client. 1 [Online] 23

51 3. Related Work A set of integrations that incorporate the model of the scenario under use. Basically, a developer needs to enter the data model of their use case to Nussknacker. Then users with no programming skills can benefit from the GUI to design a Flink job, send it to a Flink cluster and monitor its execution IBM SPSS Modeler The SPSS Modeler 2 is a data mining application developed by IBM. It is an enterpriseoriented tool that enables graphical programming of data analytics, including machine learning algorithms and visual data science. This is a very specific tool that is targeted at enterprises, for them to incorporate data mining and machine learning in a comprehensive way. But it is not designed for IoT in the first place Microsoft Azure Stream Analytics Microsoft also offers Stream Analytics 3 as part of its Azure IoT Edge 4 platform. In this case, it is an on-demand service that can be configured with a GUI, so it offers all the features of Cloud Computing. Real-time dashboards can be also created to visualize the data results, and CEP pipelines can also be created. 2 [Online] 3 [Online] 4 [Online] 24

52 4. Conceptual Approach In this chapter, the main contributions of the thesis are presented from a high level of abstraction. The approach described here has been then explicitly implemented in Java, and then evaluated. Two different models are required to meet the goals of this thesis: A model to enable the creation of programs for Stream Analytics graphically, in other words, to translate items specified in a GUI to runnable source code automatically. A model to continuously assess the end-user flow composition for semantic validity and provide feedback about it. While presenting these two models below, the design choices will be remarked Translation and Code Generation The aim of this model is to provide a way in which a graphical flow defined by the end user of the mashup tool (via its GUI) can be translated into source code to program Flink. Fig. 4.1 gives a high-level overview of what the conceptual approach to achieve such a purpose looks like. This model behaves as follows: 1. First of all, end users define graphical flows in the GUI of the mashup tool. They do so by connecting a set of Visual Components in a flow-like structure. Visual Component A Visual Component can be used by the end user to create a mashup. It represents a certain Flink functionality and has a set of properties that the user may configure according to their needs. 2. Then, a Translator gets the aggregated information of the user-defined flow, which contains: The set of Visual Components that compose the flow. 25

53 4. Conceptual Approach The way in which they are connected. The properties that the user has configured in each of them. Translator The Translator has three basic components: a Graphical Parser, an Actor System and a Code Generator. It takes as input the aggregated information of the user-defined graphical flow (Visual Components + flow structure + user-defined properties) and its output is a packaged and runnable Flink job. a) The Graphical Parser takes the aggregated information described in the previous point and processes it, to create an internal model and instantiate the set of actors that correspond to the flow. b) The Actor System is the execution environment of actors, which contains the business logic of the translator. Actors are taken from the output of the Graphical Parser. The actor model abstraction makes each actor independent, and the only way to interact with the rest is by means of exchanging messages [57]. Actors communicate using a data structure that has been explicitly defined for making the translation, with a tree-like structure that makes appending new nodes extremely easy. In this model, this data structure will be referred to as STDS (Specific Tree-Like Data Structure). A tree is a particular case of a Directed Acyclic Graph (DAG) [58], so a DAG should work as well. The tree has been chosen because of its simplicity and because it is widely used in code generation, for instance in an Abstract Syntax Tree (AST) [59]. Design Choice The messages that actors exchange contain a data structure that follows a tree-like scheme. This helps in generating the code in a later step. As stated above, each actor stands for a specific Flink functionality. In more detail, this means that each actor knows the generic structure of the code statements that are required to make use of that functionality. Rather than keeping it hard coded, this generic structure is kept in a parameterized way, and the mapping is handled by the Code Generator. 26

54 4.2. Validation of Graphical Flows Actors and Actor System A set of actors that stand for a specific Flink functionality are executed in the Actor System. They know about a parameterized, generic structure of the Flink statements that map that functionality, and exchange messages with a STDS, in which they append the generic structure plus the properties that the user defined. c) Finally, the Code Generator takes the STDS as input. It has an internal mapping that lets it translate parameterized statements into real Flink source code statements. This entity combines the parameterized statement with this mapping and the user-defined properties and generates the final source code. The compiling process also takes place here. The output of the Code Generator is a packaged, running Flink job that can be deployed in whichever instance of Flink. Code Generator The Code Generator takes a STDS as input. It processes it and combines it with a mapping of the actual Flink API and the user-defined properties, and generates, compiles and packages it to create a runnable Flink job Validation of Graphical Flows The approach presented in Section 4.1 allows the translation of graphical flows into source code. However, some graphical flows may result in source code that either cannot be compiled or contains errors. To prevent this, end users should get some visual feedback while they are creating a flow in the GUI. For this purpose, a simple yet powerful semantics language has been defined. This language is based on conditions that apply to any two GUI components, enforcing the order in which they should appear. A structure like the following one is suggested: Structure of a Semantics Condition Visual Component A }{{} main visual component should must }{{} ismandatory come (immediately) }{{} isconsecutive before after Visual Component B. }{{}}{{} argument isprecedent visual component 27

55 4. Conceptual Approach Graphical Flow (defined in GUI by the user)... Translator Graphical Parser actors + user-defined properties Actor System STDS Code Generator Runnable Flink Program Figure 4.1.: Conceptual Approach for Translation and Code Generation As it can be seen, each condition is composed of two visual components and a set of flags: Visual Component A is the main visual component in the condition. ismandatory stands for the following: If set to true: the condition must be met, e.g. "Visual Component A MUST (...) Visual Component B". If set to false: the condition should be met, but there is no obligation, e.g. "Visual Component A SHOULD (...) Visual Component B". In other words, if Visual Component B exists, then it should come (immediately) 28

56 4.2. Validation of Graphical Flows after/before Visual Component A. The condition is also satisfied if Visual Component B does not exist. The isprecedent flag indicates which node should come before the other one: If set to true: the condition is of precedence, e.g. "Visual Component A (...) AFTER Visual Component B". If set to false: the condition is not of precedence, e.g. "Visual Component A (...) BEFORE Visual Component B". This flag can be confusing, but it is straightforward if it is considered as an answer to the question of whether or not Visual Component B is precedent. The isconsecutive flag indicates whether or not the contiguity defined in the condition is strict: If set to true: contiguity is strict, e.g. "Visual Component A (...) IMMEDIATELY (...) Visual Component B". If set to false: contiguity is relaxed, e.g. "Visual Component A (...) Visual Component B". Visual Component B is the visual component that is given as parameter when specifying the condition. These conditions are essentially a many-to-many mapping among visual components, with some extra information (flags) that is added to specify details about the relationships. Design Choice To ensure semantic validity of the end-user flow in a continuous way, the validation of a graphical flow is performed whenever a new visual element is connected to it. This is the concept of immediacy that will be considered. 29

58 5. Implementation In this chapter, technical details are given about the artifacts that have been implemented as part of this thesis. The problems that have been faced and the way to solve them are focused SmartSantander Connector for Flink The main structure of a Flink program was depicted in Fig As it can be seen, all Flink programs begin with the definition of a data source. Flink comes with a set of built-in data sources that can be used right out of the box [60]: File-based data sources: they monitor a specific file or directory and read data from files in it. Socket-based data sources: allow specifying a hostname and a port to connect to and read data from it. A delimiter char is used to identify consecutive messages. Collection-based data sources: reads data from a Java java.util.collection or from an Iterator. These built-in data sources are quite simple, and clearly not enough for many use cases. For this reason, Flink offers the possibility to extend the RichSourceFunction class, and create more complex data sources. In the terminology used in Flink, a connector is an artifact that provides custom data sources (and sinks) by inheriting from this class. They are not part of the Flink core, but instead are external entities that are maintained separately. For instance, compatibility with Kafka both for reading data in and writing them out is supported by means of a connector (FlinkKafkaConsumer). Connectors do not only allow the integration of large, generic platforms like Kafka. They also support services like Twitter (TwitterSource) and Wikipedia (WikipediaSource). In the end, all connectors make use of one of the three basic data sources stated above, but they create an abstraction layer that proves to be extremely helpful when using them in Flink jobs. Therefore, the main objective of this task is to create a Flink connector that enables creating a data stream out of real-time data from the SmartSantander API. 31

5. Implementation 5.1.1. The SmartSantander API The sensors deployed throughout the city of Santander post their measurements in real-time to a back-end.

59 5. Implementation The SmartSantander API The sensors deployed throughout the city of Santander post their measurements in real-time to a back-end. The City Hall of Santander provides a dashboard that shows those data from the back-end on a map of the city, as depicted in Fig Figure 5.1.: Live Data Provided by the SmartSantander API Live data from the sensors deployed in Santander can be accessed from two main entry points in the back-end: As the city collaborates in the FIWARE project, data from some of the sensors can be accessed through the FIWARE Context Broker global instance, deployed in FIWARE Lab 1. The FIWARE Context Broker (also known as Orion) is one of the building blocks of the FIWARE platform [61]. It offers publish-subscribe features to let external applications monitor context, as well as a wide range of filtering criteria and many more advanced, fine-grained functionalities [62] that would make its integration of extreme appeal. However, for some reason, the data posted to Orion has not been updated since This is the reason why, despite the flexibility that its API provides, there was finally no other option but to discard the access of the data through Orion. The second entry point is the data catalog that the City Hall of Santander provides, which can be accessed via a REST API as it can be seen in Table 5.1 [63]. Even though the flexibility is more limited here, data are streamed in real-time with just a few minutes delay, so this is the entry point that will be used. In order to suit the use case of this thesis, three data collections have been integrated when developing the Flink connector: traffic (with the identifier datos-trafico), 1 [Online] 32

60 5.1. SmartSantander Connector for Flink Table 5.1.: Main Endpoints of SmartSantander REST API URL es/api/3/action/package_search es/api/3/action/package_show? id=[collectionid] rest/collections/[datasetid] api/rest/datasets/[datasetid]?query=[querysentence]&sort= [asc/desc] Description List of all available data collections Details about a specific data collection, indicated in [collectionid]. The metadata contains a list of the associated resources i.e. datasets. Metadata about a specific dataset resource, indicated in the [datasetid] parameter. The description of the attributes of the dataset is provided in this endpoint. Live data of a specific dataset, indicated in [datasetid]. Data can be filtered using Lucene syntax [64] in the [querysentence] option. The sorting order can be chosen with the sort option. Unless specified otherwise, data will be provided in JSON format. environment (with the identifier sensores-ambientales) and air quality (with the identifier sensores-moviles. All of them have just one resource, which provides the real-time data. Only in the case of the traffic collection is there a second resource that provides to the location of the traffic sensors. The refresh delay depends on the API, and it ranges from a couple of seconds in the traffic dataset to several minutes in the environment dataset. Tables A.1 to A.3 show in detail the attributes of the datasets. In short, all the datasets have a set of common attributes that could be used to treat them all in the same way. These attributes are: The identifier of the sensor. Its latitude and longitude. The timestamp of the measurement. Apart from these, other specific attributes indicate the measurement data itself, e.g. traffic charge in the traffic dataset, temperature in the environment dataset and NO 2 in the air quality dataset. 33

61 5. Implementation The Data Model By leveraging the object-oriented feature of Java, a different Plain Old Java Object (POJO) class will be created for each of the types of observations: TrafficObservation for measurements belonging to the traffic dataset, EnvironmentObservation for measurements belonging to the environment dataset and AirQualityObservation for measurements belonging to the air quality. The attributes of each class are in turn the attributes available in the dataset. It is important to highlight that the identifier of the sensor has a different name depending on the dataset: it can be either ayto:idsensor or dc:identifier. In the same way, the latitude and longitude attributes in the traffic dataset are not straight available, but need to be taken from the other resource in the traffic data collection. The connector will have to handle all this when retrieving the data from SmartSantander API. Finally, to accomplish an approach that is as generic as possible, an interface named SmartSantanderObservation containing the getter methods for the common attributes will be created. This way, all measurements can be treated in the same manner, regardless of their type. Fig. A.1 depicts the Unified Modeling Language (UML) diagram of this whole model The Business Logic Most of the business logic of the connector takes place in the component named SmartSantanderObservationStream. The following summarizes the process of retrieving the latest data that this component does: 1. A request to the SmartSantander endpoint with the live data API is made. This endpoint is specified when instantiating the SmartSantanderObservationStream and is given by an Enum type named SmartSantanderApiEndpoints. The Apache HyperText Transfer Protocol (HTTP) Client for Java [65] is used to make the requests. The response is parsed as a String, which contains the latest data represented as a list of JSON objects. A sample response is shown in Listing A The response String is then parsed with the Gson library [66] to deserialize the JSON objects into POJOs. 3. Each of the POJOs is then appended to an instance of BlockingQueue, which ensures consistency in concurrent contexts [67]. This process is repeated periodically at a rate that the user can configure via the updatefrequency field. The methods connect() and close() can be employed to run and stop the execution of these tasks in the background, using a different Thread for execution. 34

62 5.1. SmartSantander Connector for Flink Regarding the implementation of SmartSantanderObservationStream, there are two important considerations that must be further detailed: the way in which duplicates are avoided and how the deserialization is performed. Avoiding Duplicates The SmartSantander API offers no guarantee at all in terms of consistency of the data it provides. This is to say that, when a request to the API is made, the response can contain either resources that have already been stored or objects whose timestamps have already expired because more recent data has been processed. To keep track of all of this and avoid repeated and old values, a register is kept in a Map Java structure, whose key is the sensor identifier and whose value is the timestamp of the latest resource that has been processed. Whenever a new resource from the API is to be stored in the BlockingQueue, this Map is checked first, and the resource is avoided if the last timestamp for that sensor is equal or more recent. Deserializing JSON Resources When it comes to creating a POJO out of a JSON resource, a mapping between the fields in this resource and the fields in the corresponding Java class is required. Gson eases this task by providing a set of annotations that can be used in the definition of the Java class. This is used to map AirQualityObservation and EnvironmentObservation. An example is shown in Listing 5.1. Listing 5.1: Deserialization of JSON Resources with Annotations 1 public class AirQualityObservation implements SmartSantanderObservation { 2 4 private int sensorid; 5 6 // rest of the code of the class 7 } Although this approach is extremely simple and straightforward, it requires that the class has a constructor with no parameters. Then the Gson engine calls the setter methods of the class to assign the values of its attributes. The TrafficObservation is a special case because it requires combining data from two APIs: the live-data API itself and the location API. The latter provides the coordinates of all traffic sensors in the city in shapefile (SHP) format [68]. This file format can be easily converted to a Comma Separated Values (CSV) file by considering its spatial reference system, which in this case is ED_1950_UTM_Zone_30N. This way, whenever the class TrafficObservation is instantiated from a resource of the live-data API, its sensor identifier will be searched in the locations data and its latitude and longitude will be added to the instance. 35

63 5. Implementation One of the ways to achieve this is to perform all the required business logic in the class constructor, as shown in Listing 5.2. Listing 5.2: Constructor of TrafficObservation.java 1 public TrafficObservation(int sensorid, String timestamp, int occupation, int intensity, int charge) { 2 this.sensorid = sensorid; 3 this.timestamp = timestamp; 4 this.occupation = occupation; 5 this.intensity = intensity; 6 this.charge = charge; 7 8 double[] coordinates = findcoordinates(sensorid); 9 this.latitude = coordinates[0]; 10 this.longitude = coordinates[1]; 11 } The findcoordinates method searches the locations data by using the Apache CSV Utils library [69]. To override the default behavior of Gson deserialization, and hence be able to use a constructor with parameters, a new JsonDeserializer can be defined (see Listing 5.3). The deserializer must be then registered in the Gson engine (Listing 5.4). Listing 5.3: TrafficObservationDeserializer 1 private class TrafficObservationDeserializer implements JsonDeserializer< TrafficObservation> { 3 public TrafficObservation deserialize(jsonelement jsonelement, Type type, JsonDeserializationContext jsondeserializationcontext) throws JsonParseException { 4 JsonObject jsonobject = jsonelement.getasjsonobject(); 5 6 return new TrafficObservation( 7 jsonobject.get("ayto:idsensor").getasint(), 8 jsonobject.get("dc:modified").getasstring(), 9 jsonobject.get("ayto:ocupacion").getasint(), 10 jsonobject.get("ayto:intensidad").getasint(), 11 jsonobject.get("ayto:carga").getasint() 12 ); } 15 } Listing 5.4: Registering the Traffic Deserializer in Gson 1 GsonBuilder gsonbuilder = new GsonBuilder(); 2 gsonbuilder.registertypeadapter( 3 TrafficObservation.class, 4 new TrafficObservationDeserializer()); 5 Gson gson = gsonbuilder.create(); 36

64 5.1. SmartSantander Connector for Flink Java Generics and Type Erasure In order to make the approach as generic as possible, Java generics have been used in the implementation of SmartSantanderObservationStream. This way, the same code is reused regardless the type of SmartSantanderObservation. Listing 5.5 shows an example. Listing 5.5: Instantiation of SmartSantanderObservationStream 1 Class<T[]> observationstype = TrafficObservation[].class; 2 SmartSantanderAPIEndpoint endpoint = SmartSantanderAPIEndpoint.TRAFFIC; 3 int updatefrequency = 4; 4 SmartSantanderObservationStream<TrafficObservation> trafficstream = new SmartSantanderObservationStream<>(observationsType, endpoint, updatefrequency); 5 6 Class<T[]> observationstype = EnvironmentObservation[].class; 7 SmartSantanderAPIEndpoint endpoint = SmartSantanderAPIEndpoint.ENVIRONMENT; 8 SmartSantanderObservationStream<EnvironmentObservation> environmentstream = new SmartSantanderObservationStream<>(observationsType, endpoint, updatefrequency); As it can be noted in Listing 5.5, one of the parameters that must be given to the constructor of SmartSantanderObservationStream is the Class instance of the type parameter (an array of it, in this case). When performing the deserialization, Gson needs an instance of the class of an array of the associated POJOs. However, because of the way Java handles Type Erasure [70], this cannot be coded using generic types, because it is not possible to retrieve the runtime type of generic type parameters. A common practice is in fact to include the Class instance as a parameter in the constructor [71] The Flink Data Source The entry point of Flink connectors to the Flink engine is the SourceFunction class. Any class that extends from this one can be used as a data source when creating Flink jobs. In this SmartSantander connector, the SmartSantanderSource class serves as a wrapper of the business logic specified in the SmartSantanderObservationStream, that binds it to Flink. As shown in Listing 5.6, the overridden run method of SourceFunction is in charge of instantiating the SmartSantanderObservationStream and periodically polling its BlockingQueue to retrieve the latest observations. Listing 5.6: SmartSantanderSource 1 public class SmartSantanderSource<T extends SmartSantanderObservation> extends RichSourceFunction<T> { 2 3 // rest of the class 4 6 public void run(sourcecontext<t> ctx) throws Exception { 37

65 5. Implementation 7 try (SmartSantanderObservationStream<T> stream = new SmartSantanderObservationStream<>(observationsArrayClass, endpoint, updatefrequency)) { 8 // Open connection 9 stream.connect(); while (isrunning) { 12 // Query for the next observation event 13 T event = stream.getobservations().poll(1, TimeUnit.SECONDS); if (event!= null) { 16 ctx.collect(event); 17 } 18 } 19 } 20 } 21 } The UML diagram of the classes in charge of the business logic and the binding to Flink is depicted in Fig. A Mashup Components for aflux As explained in Section 2.4, aflux supports plug-ins to expand its functionality. The main objective of this implementation task is hence to build a plug-in that contains a set of mashup components that enable the creation of Flink programs. Plug-ins for aflux should have the aflux-tool-base artifact as a dependency. Each of the mashup components in it is typically implemented with two classes: A class inheriting from AbstractMainExecutor 2, which contains all the information related to the mashup component in terms of end-user visualization, i.e. its name, color, and a reference to the class that contains the associated AbstractAFluxActor. Whenever a new plug-in is loaded into aflux, all its AbstractMainExecutor are loaded and offered to the user for the creation of new flows. A class inheriting from AbstractAFluxActor 3, which is instantiated and added to the actor system whenever the mashup component is used in a flow. This is where all the computations and the business logic related to the component take place. AbstractAFluxActor is just an Akka Actor which includes more functionalities that are specific about aflux, e.g. whether it is capable to handle asynchronous operations. The main task of actors in this plug-in is to translate the user s preferences so as to generate the corresponding Java code. 2 Full class name: de.tum.in.aflux.tools.core.abstractmainexecutor 3 Full class name: de.tum.in.aflux.tools.core.abstractafluxactor 38

66 5.2. Mashup Components for aflux Java Code Generation The main purpose of the business logic behind each of the mashup components is to translate the user s preferences, which they have specified using the GUI of aflux, into the corresponding Java code that makes use of the Flink API. Therefore, there must be a mechanism to generate Java code. The JavaPoet library has been chosen because of its ease of use, simplicity and performance. According to its documentation [72], JavaPoet understands types and allows assembling a file as a tree of declarations, rather than streaming its contents top-to-bottom in a single pass, like other outdated approaches [73]. The most relevant component of JavaPoet for the purpose of this thesis is the ClassName type. When generating a statement, a ClassName instance can be provided to let JavaPoet know the type it refers to. JavaPoet handles imports of the required ClassName s automatically. The TypeName super class supports both assignation of ClassName s as well as parameterized types. This is an important point, because a data stream in Flink always has a type parameter which indicates the class of objects that it is composed of. Apart from these, there are also classes to generate new Java entities, not just refer to existing ones like ClassName and TypeName. With the MethodSpec and TypeSpec types, new methods and types can be generated respectively. Listing A.2 shows an example of how to generate the anonymous class shown in Listing A Message Passing Among Actors As it has been explained in Section 2.4, the actors in an actor system communicate just by exchanging messages. For this reason, it is necessary to define which is the format of the messages that the actors in this Flink plug-in will be working with. This message should include all the necessary information for an actor to make its computations properly. The messages allowed in the Akka actor model can be any Java Object. So, any POJO can be used to define the message format. For this Flink plug-in, the message is defined by the FlinkFlowMessage class, that is also part of the plug-in. The message is composed of three main fields: A CodeBlock.Builder, which is part of the JavaPoet library and stores all the information required to generate the final code. This is the most important field, because any actor of the plug-in can append code statements by invoking its addstatement() method. This entity allows writing but not reading, yet some information of the previously added statements is required for consequent actors to properly add theirs, so two more parameters are required. A TypeName that refers to the last variable type assigned by previous actors. This is crucial when it comes to chaining consequent operations on data streams. Rather than being independent from each other, they usually take as input the output of 39

67 5. Implementation the previous one, if there is any. So, actors must share this information when they communicate with each other. A String containing the last data stream variable name that has been assigned, as not only the TypeName of the last variable is required to enable chaining, but also its variable name. Random variable names are created whenever an actor requires it (to avoid duplicities and hard coding) with the Apache Commons Lang library [74]. Finally, to support the definition of patterns for CEP without losing track of the data streams, a fourth attribute is included, which is a String containing the last pattern variable name that has been assigned. When an actor of the plug-in is invoked by the actors system because a previous one sent a message to it, it will try to cast the message it receives to a FlinkFlowMessage and throw an error otherwise Structure of the Actors All the implemented actors perform the translation from the user s preferences into Flink code similarly, following this structure: 1. The message from the previous actor in the actor system (if any), is validated and casted to a FlinkFlowMessage. The CodeBlock.Builder instance is then retrieved from it. 2. The properties that the user has configured in the GUI are retrieved using the getproperty() method of AbstractAFluxActor. 3. The types that are necessary for generating the code are described using JavaPoet. This includes the definition of anonymous types (e.g. for transformations, see Section for more details) as well as the creation of references to any existing type (e.g. creating a ClassName object to refer to the Flink DataStream class). 4. All the parameters and types defined in the previous step are added to a Map<String, Object>. 5. The Map is passed as a parameter to the CodeBlock.Builder, and the required code statements are appended to it. 6. Finally, the new FlinkFlowMessage is built and passed on to the next actor in the actor system (if any). Of course, this is just a generic structure, and some actors do not follow exactly this same behavior. In the following sections, the behavior and the translation that takes place in each of the implemented mashup components is presented. The most relevant information about them is summarized in Table A.4. In total, twelve mashup components have been implemented: 40

68 5.2. Mashup Components for aflux Begin Job Window CEP New Pattern SmartSantander Data GPS Filter Select Window Operation Output Result CEP Begin CEP Add Condition CEP End End Job Setting Up the Flink Environment The first step in every Flink job is to obtain an execution environment [75]. This involves both retrieving an instance of it and making any necessary configurations. The "Begin Job" component (see Table A.4) is in charge of all of that. Main Executor: Node01EnvironmentSetUp. Regarding the GUI part of the component, it has no input so that the user realizes that no other mashup component should be placed before it in the flow. aflux Actor: EnvironmentSetUpActor. This component could include as many configuration parameters that are required. For the purpose of this thesis, just the TimeCharacteristic of streams is considered. It is set to EventTime by default, to meet the SmartSantander use case SmartSantander Data Source Apart from configuration, defining the source of the data is the first element that should appear in a Flink program (see Fig. 2.4). In this thesis, just data from the SmartSantander API are considered as possible data sources. The "SmartSntndr Data" component (see Table A.4) allows this by means of the SmartSantander connector that has been detailed in Section 5.1. Main Executor: Node02SmartSantanderDataSource. The properties of this component that the user can configure are shown in Table 5.2. The set of observation types that are offered to the user for them to choose is computed on run-time (see Section 5.3 for more details). aflux Actor: SmartSantanderDataSourceActor. This component creates a SmartSantanderSource with the parameters that the user desired and a DataStream that makes use of it. Besides, to enable the notion of Event Time when Flink processes the streams [30], timestamps and watermarks need to be added to the data. This is done in the way that is described in [76], by using an assigner with ascending timestamps. In summary, an example of the code that could be generated by this mashup component is shown in Listing

69 5. Implementation Table 5.2.: Properties of the "SmartSntndr Data" Mashup Component Property Type Details Initial Value Observation Type SELECT Choose among: Traffic Environment AirQuality Update Time TEXT Time interval to check Smart- Santander API. It should be an integer. Traffic 4 Listing 5.7: Sample Code Generated by the "SmrtSntnder Data" Component 1 SmartSantanderSource<EnvironmentObservation> source = new SmartSantanderSource<>( 2 EnvironmentObservation[].class, 3 SmartSantanderAPIEndpoints.ENVIRONMENT, ); 6 DataStream<EnvironmentObservation> environmentobservations = see.addsource( environmentsource, TypeInformation.of(EnvironmentObservation.class)) 7.assignTimestampsAndWatermarks(new AscendingTimestampExtractor< EnvironmentObservation>() { 9 public long extractascendingtimestamp(environmentobservation environmentobservation) { 10 return Instant.parse(environmentObservation.getTimestamp()).toEpochMilli(); 11 } 12 }); Transformation Mashup Components Four of the available mashup components let the user specify transformations on the initial data, which have been retrieved from the SmartSantander API using the mashup component described in Section Flink transformations operate on a specific data stream (or a set of them) and convert it into another one [77]. Flink programs can combine multiple transformations to create complex topologies and stream processors it is just up to the user. Transformations are as well the cornerstone of Flink programs, as depicted in Fig Keeping in mind the SmartSantander use case, this aflux plug-in supports the following transformations: Filter, Map, WindowAll, and both built-in and custom Aggregations on windows. A list of all the transformations that are currently available in Flink can be obtained in [78]. 42

70 5.2. Mashup Components for aflux Filtering Transformations The Filter transformation in Flink evaluates a boolean function for each element of the data stream and retains only those which meet that condition [78]. This functionality can be leveraged to let the user specify a certain location and filter out data that comes from any sensor that is too far from that place. The degree of distance is defined by a radius: sensors whose location fails within the circle are accepted, while the data from the rest of them are rejected. Main Executor: Node03TransformationFilter. Both the latitude, longitude and radius that are used to perform the filtering transformation are configured from a single property, as shown in Table 5.3. This property has a special type that has been implemented as part of this thesis, for the sake of user-friendliness: LOCATION_PICKER. This property has been defined in the aflux engine (de.tum.in.aflux.tools.core.property InputType), and is rendered by the front-end as it can be seen in Fig. 5.2: First, three text fields are created. The user may enter here the numeric values for the latitude, longitude and radius they desire. Latitude and longitude must be specified in decimal degrees. Second, a Google Maps 4 widget is created to let the user fill in the latitude and longitude fields just by drag-and-dropping a marker (pushpin). The React Location Picker [79] made it possible to integrate the Google Maps JavaScript API 5 into the React-based front-end of aflux 6. When the marker is drag-and-dropped, the texts fields are automatically updated to give the user feedback about the location they just chose. In the same way, the widget refreshes automatically whenever a new value of latitude, longitude or radius is entered in the text fields. The desired filtering location (latitude, longitude and radius) is serialized as a String, following the "[latitude],[longitude],[radius]" format. aflux Actor: TransformationFilterActor. The algorithm used to compute the distance between coordinates uses the Haversine formula as a basis [80], as shown in [81]. This actor adds the code of the algorithm inside the Flink filtering function. An anonymous class is required to describe this function and passed as a parameter to the filter method of any data stream. In summary, an example of the code that could be generated by this mashup component is shown in Listing [Online] 5 [Online] 6 For the widget to work, an API key of Google Maps needs to be entered in the index.html file located under aflux/src/main/java/webapp. To get a key, visit: documentation/javascript/get-api-key 43

71 5. Implementation Table 5.3.: Properties of the "GPS Filter" Mashup Component Property Type Details Initial Value Filtering location LOCATION_PICKER Contains three text fields, for latitude, longitude and radius, plus a draggable Google Maps widget for a more user-friendly integration. Santander city center with a radius of 1 km: latitude = longitude = radius = 1000 Figure 5.2.: The Location Picker Property Listing 5.8: Sample Code Generated by the "GPS Filter" Component 1 DataStream<EnvironmentObservation> filteredenvironment = environmentobservations.filter (new FilterFunction<EnvironmentObservation>() { 3 public boolean filter(environmentobservation input) throws Exception { 4 final double EARTH_RADIUS = 6371; 5 double reflat = ; 6 double reflng = ; 7 double radius = 0.7; 8 double currentlat = input.getlatitude(); 9 double currentlng = input.getlongitude(); 10 double dlat = Math.toRadians(refLat - currentlat); 11 double dlng = Math.toRadians(refLng - currentlng); 12 double sindlat = Math.sin(dLat / 2); 44

72 5.2. Mashup Components for aflux 13 double sindlng = Math.sin(dLng / 2); 14 double va1 = Math.pow(sindLat, 2) + Math.pow(sindLng, 2)* Math.cos(Math. toradians(currentlat)) * Math.cos(Math.toRadians(refLat)); 15 double va2 = 2 * Math.atan2(Math.sqrt(va1), Math.sqrt(1 - va1)); 16 double distance = EARTH_RADIUS * va2; 17 return distance < radius;} 18 }); Map Transformations The Map transformation in Flink produces a new element out of each of the elements of the data stream [78]. This functionality can be leveraged to let the user specify a specific field of the type of SmartSantanderObservation that they chose in the "SmartSntndr Data" component and create a new data stream that contains just the values of this field, to enable further processing. Main Executor: Node04TransformationMap. Just one property is needed in this mashup component: a field that allows the selection of the attribute to map. Table 5.4 shows the details. Table 5.4.: Properties of the "Select" Mashup Component Property Type Details Initial Value Attribute to Map SELECT Choose among: Traffic Occupation Traffic Charge Traffic Intensity Noise Level Temperature Light Intensity Traffic Occupation Level of NO 2 Level of CO Level of Ozone aflux Actor: TransformationMapActor. An anonymous class needs to be created to describe the mapping function, which invokes the appropriate getter method to create the new data stream. This class is then passed as a parameter to the map method of any data stream. In summary, an example of the code that could be generated by this mashup component is shown in Listing

73 5. Implementation Listing 5.9: Sample Code Generated by the "Select" Component 1 DataStream<Double> temperature = environmentobservations.map(new MapFunction< EnvironmentObservation, Double>() { 3 public Double map(environmentobservation environmentobservation) throws Exception { 4 return (Double)environmentObservation.getTemperature(); 5 } 6 }); Window Transformations Windows are key to processing infinite data flows, because they allow the creation of finite bundles of data events that can be processed together. Flink offers a wide range of configuration elements when windows are defined [31]: Keyed vs. Non-Keyed. Windows can be created either from a DataStream or a KeyedStream. The basic difference between the two is that a KeyedStream is made up of disjoint partitions of the data events, according to a specified key, and it allows what Flink calls keyed state [82]. This functionality is not needed in the context of this thesis, so windows will be created from plain data streams, not keyed ones. Window Assigner. This element is in charge of processing all the elements in the data stream and decide whether or not each of them should make part of a specific window. Flink provides a set of predefined window assigners, that can be used to create different types of windows. Among all of them, tumbling and sliding windows are the most relevant for the use case of this thesis (apart from being the most important ones). In a tumbling approach, windows do not overlap, and each element is assigned to a maximum of one window. However, if overlapping windows are desired, then the sliding approach should be used. Trigger. This element makes the decision of whether or not the window is ready to be processed. As it is just used for advanced set-ups, just the default trigger, which fires on the progress of time, is supported in this aflux mashup component. Evictors. This is an optional parameter that can be used to remove elements from a certain window after it has been processed. Again, it is just used for advanced and very specific use cases, which do not match that of this thesis. Hence, they are not supported. Lateness and Side Outputs. Flink supports the fact that events may arrive late, when event-time is configured. However, in this use case this is already done by the SmartSantander API, so there is no need to support it in this mashup component. 46

74 5.2. Mashup Components for aflux Window Function. This is where the operation to be performed on the elements of a window is defined. More details are given in the "Window Operation" mashup component. In the "Window" mashup component, all the above parameters may be defined by the user, except from the window function, which is specified in the "Window Operation" mashup component. From the user perspective, it makes sense to define the conditions to create windows in the first place, and the operations to be applied on them in the second place, as they are applied after the window is created. Table 5.5 summarizes which elements of windows are supported in this plug-in and which ones are not. Table 5.5.: Supported Windows APIs Supported Items/APIs Non-Supported Items/APIs Keyed/Non-Keyed Non-Keyed Windows Keyed-Windows Window Assigners Tumbling, Sliding Global (requires custom trigger), Session Trigger Default Trigger Custom Triggers Evictors Lateness Not supported Not supported Built-in Functions min, max, sum minby, maxby Custom Functions Reduce, Aggregate Fold, Apply Main Executor: Node05TransformationWindow. mashup component, as shown in Table 5.6. Four properties are defined for this aflux Actor: TransformationWindowActor. This mashup component must consider that creating a different type of window (tumbling/sliding) results in a different number of parameters being required when invoking the windowing function. In summary, an example of the code that could be generated by this mashup component is shown in Listing Listing 5.10: Sample Code Generated by the "Window" Component 1 // Tumbling windows with a size of 5 minutes 2 AllWindowedStream<Double, TimeWindow> tumblingwindowedtemperature = temperature. windowall(tumblingeventtimewindows.of(time.minutes(5))); 47

75 5. Implementation Table 5.6.: Properties of the "Window" Mashup Component Property Type Details Initial Value Window Type SELECT Choose among: Tumbling Sliding Window Units SELECT Choose among: Hours Minutes Seconds Window Size TEXT Length of the window, in event-time. It should be an integer. Its units are configured in "Window Units" Window Slide TEXT Size of the slide, in event-time. Just required for sliding windows. Its units are configured in "Window Units" Tumbling minutes // Sliding windows with a size of 5 minutes, overlapped in 1 minute 5 AllWindowedStream<Double, TimeWindow> slidingwindowedtemperature = temperature. windowall(slidingeventtimewindows.of(time.minutes(5), Time.minutes(1)) Window Operations A key element of windows in Flink is the operation that is applied to all the elements in it, as explained above. Aggregations are the most flexible type of window operations [31], and Flink comes with a set of built-in functions, like computing the maximum, minimum, and sum of the window elements. All of them are supported in the "Window Operation" aflux mashup component. Apart from them, custom functions can be defined by means of the AggregateFunction Flink API. Main Executor: Node06TransformationWindowOperation. for this mashup component, as shown in Table 5.7. Two properties are defined aflux Actor: TransformationWindowOperationActor. This mashup component must consider whether or not the specified aggregation function is a built-in or a custom function. If it is a built-in function, the method is invoked directly from the AllWindowedStream created by the "Window" Tool. Otherwise, an anonymous class needs to be defined. 48

76 5.2. Mashup Components for aflux Table 5.7.: Properties of the "Window Operation" Mashup Component Property Type Details Initial Value Window Operation SELECT Choose among: Sum Min Max Aggregate Aggregate Window Aggregate Operation SELECT Shows all available operations. Currently just Average. Average Custom functions must be defined manually in the Flink project template and implement the AggregateFunction interface from Flink API. An AggregateFunction has three type parameters: an input type, an accumulator type to which input elements are added, and an output type [31]. An example of the code that could be generated by this mashup component is shown in Listing Listing 5.11: Sample Code Generated by the "Window Operation" Component 1 // Built-in aggregation function 2 DataStream<Double> maxtemperature = windowedtemperature.max(); 3 4 // Custom aggregation function (average) 5 DataStream<Double> averagetemperature = windowedtemperature.aggregate(new AverageAggregate()); Outputting a Data Stream In Flink, a Data Stream can be sent to a file, a socket, an external system, or simply be printed through the standard output. Data sinks are usually how the business logic of a Flink program finishes (see Fig. 2.4). Just like it happens with data sources, as it was explained in Section 5.1, Flink comes with a set of built-in data sinks that are just useful for debug purposes or at most very basic scenarios: Plain Text. Creates a single String out of all the data in the data stream. Standard Out Stream. Prints the content of the data stream e.g. in the shell. CSV. Creates a comma-separated values file. Writing to a socket is also supported. 49

77 5. Implementation In the rest of scenarios, Flink advises to use a connector. Again, some connectors are provided by the Flink team itself, and they range from advanced filesystem outputs to publishing the data to a Kafka topic. Considering the use case of this thesis, this plug-in supports: Standard out stream, which is useful for debug purposes. CSV, which allows further analysis of the data, e.g. create a graph out of it. Kafka topic, to make this aflux plug-in compatible with any other platform that uses Kafka. Main Executor: Node07OutputResult. Table 5.8 contains the details about the properties of the "Output Result" mashup component. One field is used to specify the desired output sink, while the others are just required if that specific sink was chosen. Table 5.8.: Properties of the "Output Result" Mashup Component Property Type Details Initial Value Output Type SELECT Choose among: Plain Text Kafka CSV Kafka Topic TEXT Kafka topic to publish the data to. Kafka Address TEXT URL in which Kafka is listening. Log identifier TEXT Identifier for the logs in Kafka. It will be prepended to each message. CSV Filename TEXT Name of the output CSV file. Its location depends on how aflux is run (usually it will be located in the Apache Tomcat folder) Kafka my-topic localhost:9092 output output.csv aflux Actor: OutputResultActor. This mashup component must consider whether or not the specified data sink is built-in right in the Flink core or, instead, is part of a connector. If it is a built-in function, the method is invoked directly from the DataStream. Sometimes, an anonymous class may need to be defined, to map the data stream into the type required by the sink. 50

78 5.2. Mashup Components for aflux An example of the code that could be generated by this mashup component is shown in Listing Listing 5.12: Sample Code Generated by the "Output Result" Component 1 // CSV Output (connector ) 2 mydatastream.map(new MapFunction<Double, Tuple1<Double>>() { 4 public Tuple1<Double> map(double adouble) throws Exception { 5 return Tuple1.of(aDouble); 6 } 7 }).writeascsv("output.csv", FileSystem.WriteMode.OVERWRITE).setParallelism(1); 8 9 // Std-out Output 10 mydatastream.print(); Complex Event Processing CEP is about real-time analytics of continuous data streams [83]. It can be used to detect event patterns in the data, which stand for situations of interest (such as threats or opportunities) to which a quick response would be desired [84]. Flink supports CEP by means of a library called FlinkCEP [85]. This library allows the definition of patterns which may be as complex as the user desires, as well as the way in which a pattern match is handled by the system. This Flink plug-in for aflux supports FlinkCEP to enable end users to program CEP graphically. Indeed, being able to detect patterns in the stream of events makes sense in IoT [83, 86], and even more in the context of a Smart City [87], such as Santander. The API of FlinkCEP has two basic groups of functionalities: Define Patterns. These are the event sequences that are to be searched in the data stream. Flexibility is maximum, so the full pattern can be described as a chain of individual patterns, which are in turn a set of conditions that a given event should match. A pattern matches when all its individual patterns match. Also, some conditions apply to jump from one individual pattern to the next one. Three mashup components can be used for this purpose: "CEP Begin", "CEP Add Condition" and "CEP New Pattern". Match Patterns. After the desired pattern has been described, it needs to be searched in the target data stream. Besides, any time there is a match, some action must be triggered, and this should also be defined by the user. The "CEP End" mashup component is in charge of these functionalities. Listing A.5 shows a basic example of how the FlinkCEP library should be used. The following entities can be identified: A DataStream definition, in which the pattern will be searched. Of course, its type parameter should match that of the Pattern. 51

79 5. Implementation The pattern sequence is composed of three individual patterns, namely "start", "middle" and "end". Each of them is defined by two main parameters: A SimpleCondition that a given event should meet. The condition is given as an anonymous class parameter to the where function. More conditions could be added with the and and or functions. The degree of contiguity between the individual patterns, that is to say, the conditions for the sequence of events to jump from one individual pattern to the next one. Contiguity can be strict, if all matching events are expected to appear strictly one after the other, relaxed, if non-matching events appearing in-between the matching ones are ignored, or non-deterministic, relaxed, if additional matches that ignore some matching events are accepted [85]. Contiguity is indicated in the function that is used to define the name of the individual pattern. In the example, next() stands for strict contiguity and followedby stands for relaxed contiguity. But more options are available, as it can be seen in [85]. Once the pattern has been defined, it is matched to the desired data stream. Then, the results of the matchings are selected with a special class called Pattern SelectFunction. This takes one or more events of the matching pattern and generates some other object out of them. In the example, a user-defined Alert POJO is instantiated. This way, the stream of Alert can be treated as if it were any other Flink data stream, and the Flink program may continue. It is important to highlight that, apart from the conditions to jump from one individual pattern to the next one (degree of contiguity between patterns), the number of occurrences of each individual pattern can also be specified (degree of contiguity of a single, looping, individual pattern). This way, patterns like "events matching condition A 3 times, which are immediately followed by events matching condition B 2 or more times" can be defined. In FlinkCEP, the contiguity of an individual pattern should be defined after defining a certain condition. Furthermore, ending conditions can be specified, like within(time) to specify a time period in which a pattern should be met to consider it a matching, and until(condition) to specify when a looping pattern should end. Table 5.9 is a quick summary that will be later used for reference. To explain in detail how CEP and the FlinkCEP library work is outside of the scope of this document, and [85] can be consulted for further reference. The main idea here is to show that the library provides maximum flexibility for Flink programmers, but at the same time it is easy to get confused with concepts like pattern contiguity vs. individual pattern contiguity, several individual patterns in a pattern sequence vs. several conditions in an individual pattern, etc. This is the reason why enabling users to program CEP graphically is extremely interesting: because all this complexity is hidden to end users, while at the same time they may access virtually identical functionality. 52

80 5.2. Mashup Components for aflux Table 5.9.: Defining Contiguity in FlinkCEP Between Patterns Inside a Pattern Degree of Contiguity Strict Relaxed next() notnext() followedby() notfollowedby() consecutive() nothing, this is the default Non-deterministic, relaxed followedbyany() allowcombinations() Quantifiers within(time) times(#times) times(#from, #to) timesormore(#times) optional() greedy() Begin Pattern Sequence The first step when using FlinkCEP is to define its Skip Strategy. The "CEP Begin" mashup component is in charge of this task. The skip strategy controls to how many matches an event will be assigned, with two options: NO_SKIP, in which every possible match will be emitted, or SKIP_PAST_LAST_EVENT, to discard every partial match that contains an event that was previously part of a matching pattern. Main Executor: Node08CepPatternBegin. As it can be seen in Table 5.10, just one property is available in this mashup component. Table 5.10.: Properties of the "CEP Begin" Mashup Component Property Type Details Initial Value Skip Strategy SELECT Choose among: NO_SKIP SKIP_PAST_LAST_EVENT NO_SKIP 53

81 5. Implementation aflux Actor: CepPatternSequenceBeginActor. The business logic of this mashup component is quite simple, as it only generates the code to configure the skip strategy. Listing 5.13 shows an example. Listing 5.13: Sample Code Generated by the "CEP Begin" Component 1 AfterMatchSkipStrategy strat = AfterMatchSkipStrategy.noSkip(); Add Individual Patterns Once the skip strategy has been defined, at least one individual pattern should be added. This is supported by the "CEP New Patt." mashup component. Main Executor: Node09CepPatternAppend. Table 5.11 shows all the properties that may be defined in this mashup component. As it can be seen, apart from the name of the pattern, contiguity may be defined in a very intuitive way. Table 5.11.: Properties of the "CEP New Patt." Mashup Component Property Type Details Initial Value Pattern Name TEXT Name of the pattern to be used when selecting its matching events. Contiguity SELECT Choose among: strict strict negative relaxed relaxed negative Min. Repetitions TEXT Minimum number of times that the pattern should be repeated. Max. Repetitions TEXT Maximum number of times that the pattern should be repeated. -1 stands for infinite, i.e. no maximum "mypattern" strict Optional CHECKBOX Could the pattern not be matched? FALSE Greedy CHECKBOX Is the pattern greedy? FALSE 1-1 Consecutive CHECKBOX Should repetitions be consecutive? (strict vs. relaxed contiguity) Within TEXT Time to match the pattern (in seconds). -1 stands for infinite, i.e. no time restrictions FALSE -1 54

82 5.2. Mashup Components for aflux aflux Actor: CepPatternAppendActor. In this case, the component actor is in charge of generating the code for the new pattern, with all the properties that the user may have configured to set-up quantifiers and the degree of contiguity. One important issue is whether or not the pattern that is being defined is the first one in the pattern sequence or not, because Flink treats them differently. This is where the currentpatternvariablename field in the FlinkFlowMessage, which was already mentioned in Section 5.2.2, takes part. If this field is null, then it means that no previous pattern has been defined, hence the current pattern is the first one in the sequence. Otherwise, a previous one has been already defined, so the new one will be appended to it. In summary, an example of the code that could be generated by this mashup component is shown in Listing Listing 5.14: Sample Code Generated by the "CEP New Patt." Component 1 // Using the skip strategy defined in the "CEP Begin" mashup component 2 Pattern<TrafficObservation, TrafficObservation> pattern1 = Pattern.<TrafficObservation> begin("start", strat); 3 4 // When appending a new pattern with relaxed contiguity 5 Pattern<TrafficObservation, TrafficObservation> pattern2 = pattern1.followedby("middle" ); 6 7 // Quantifiers may also be added 8 Pattern<TrafficObservation, TrafficObservation> pattern3 = pattern2.followedby("end"). optional().timesormore(4); Pattern Conditions Individual patterns may be constituted by one or more conditions that a given event should match. The "CEP Add Condition" mashup component is in charge of defining these conditions. In this case, only conditions applying to the SmartSantander use case are considered, to make the approach as simple and user-friendly as possible. Hence, only conditions related to comparing numeric values are considered. This allows the detection of patterns like "temperature increasing more than 15 C within two hours", etc. Main Executor: Node10CepPatternCondition. The user may specify whether the condition should be combined with other conditions in the pattern with an AND or an OR operation. They may also choose the attribute of the corresponding SmartSantanderObservation that they want to check in the condition, the type of comparison and the value to compare to. Table 5.12 shows all the details. 55

83 5. Implementation Table 5.12.: Properties of the "CEP Add Condition" Mashup Component Property Type Details Initial Value Condition Type SELECT Choose among: Attribute to match Condition Operand SELECT SELECT AND OR Choose among: Traffic Occupation Traffic Charge Traffic Intensity Noise Level Temperature Light Intensity Level of NO 2 Level of CO Level of Ozone Choose among: greater than (>) greater or equal than (>=) less or equal than (<=) less than (<) equals (==) not equals (!=) Condition Value TEXT Value to compare the attribute to. Must be a number. AND Traffic Charge greater or equal than (>=) 50 aflux Actor: CepPatternConditionActor. The actor will generate the anonymous class to define the appropriate SimpleCondition, and invoke the right method depending on the AND/OR choice of the user. Listing 5.15 shows an example. Listing 5.15: Sample Code Generated by the "CEP Add Condition" Component 1 Pattern<TrafficObservation,?> pattern2 = pattern1.where(new SimpleCondition< TrafficObservation>() { 3 public boolean filter(trafficobservation trafficobservation) throws Exception { 4 if (trafficobservation.getcharge() >= 50) 5 return true; 6 return false; 7 } 8 }); 56

84 5.2. Mashup Components for aflux End Pattern Sequence Finally, the "CEP End" mashup component is in charge of two main issues: Apply the defined pattern sequence to a certain data stream. Select an event of the matching ones in a certain individual pattern and trigger some notification. Notifications can be any POJO, for instance in the example shown in Listing A.5, a class named Alert had been created. In this case, the SmartSantanderAlert class has been defined and added to the SmartSantander Connector for Flink. It just has a String field that contains a message to inform the user about what went wrong. Which information will be included in this message is also configured by the end user in the "CEP End" mashup component. Main Executor: Node11CepPatternEnd. The user may specify whether the condition should be combined with other conditions in the pattern with an AND or an OR operation. They may also choose the attribute of the corresponding SmartSantanderObservation that they want to check in the condition, the type of comparison and the value to compare to. Table 5.13 shows all the details. Table 5.13.: Properties of the "CEP End" Mashup Component Property Type Details Initial Value "Pattern name" TEXT Name of the individual pattern to select. It was previously defined by the user when creating the pattern. "Element no." TEXT Number of the event in the matching pattern whose details want to be triggered. For instance, a user may be interested in knowing the value of the last event. "Message" TEXT Text to include in the message. The details about the selected event will be added automatically. "mypattern" 0 57

85 5. Implementation aflux Actor: CepPatternSequenceEndActor. The actor will generate the statements to match the pattern to the desired data stream and to generate the data stream of alarm POJOs, as shown in Listing Listing 5.16: Sample Code Generated by the "CEP End" Component 1 // Apply the pattern to the trafficdatastream DataStream<TrafficObservation> 2 PatternStream<TrafficObservation> patternstream = CEP.pattern(trafficDataStream, pattern3); 3 4 // Select the first event of the individual pattern named "end" 5 // Add the message "Charge went too high in " 6 // And generate a DataStream of SmartSantanderAlert out of it 7 DataStream<SmartSantanderAlert> alerts = patternstream.select(new PatternSelectFunction <TrafficObservation, SmartSantanderAlert>() { 9 public SmartSantanderAlert select(map<string, List<TrafficObservation>> map) throws Exception { 10 TrafficObservation event = map.get("end").get(0); 11 return new SmartSantanderAlert("Charge went too high in " + event.tostring() ); 12 } 13 }); Executing and Generating Job The last step in a Flink program is to trigger the program execution [75]. The "End Job" mashup component is in charge of this, and much more, because the final code is still to be generated at this point, as well as packaging the Flink job into a.jar file. Main Executor: Node99ExecuteAndGenerateJob. This mashup component has no property, and it has also no output, to let the user know that it should be placed at the very end of the flow they create in the GUI. afluxactor: ExecuteAndGenerateJobActor. To be precise, not a single line of Java code is generated by any other of the actors described above. They just append statement definitions and parameters to the CodeBlock.Builder instance that they find inside the FlinkFlowMessage that they receive. The code itself is generated in this last step. In fact, the procedure is the following: 1. First, the statement to trigger the Flink program execution is added, just like any other actor of this plug-in would do. 2. All Flink programs should start from a Maven project template that the Flink team itself provides [88]. To make this aflux plug-in absolutely self-contained, the template has been integrated as a resource inside it. However, because of the way that Java archives work (remember from Section that aflux plug-ins are 58

86 5.3. Automatic Mapper for Flink API executed from their corresponding.jar files), this template needs to be extracted before it can be used in any way. This is considerably challenging, especially considering that the structure of paths in the file system changes depending on the Operating System. So, paths need to be defined programmatically from system constants and, after that, an approach like the one described in [89] may be used to extract the required files. 3. Once the project template has been extracted, the final Java code can be generated using JavaPoet, which translates all the statement definitions that were appended to the CodeBlock.Builder and makes use of all the user-defined constraints and properties. This code is added to a new file inside the project template that was just extracted. 4. The output of this Flink plug-in for aflux should be a Java archive (a.jar file) that can then be deployed in any Flink instance. Thus, it is not enough with generating the source code. Instead, this code needs to be compiled and packaged by Maven. Fortunately, although the most common interface to do this is via Command-Line Interface (CLI) Maven supports doing this from Java as well, by means of the Maven Invoker [90]. 5. The output of Maven Invoker is sent to the aflux engine in order to show it to the user, and (as long as there are no compilation errors, more on this in Section 5.4) the.jar containing the Flink job is finally available for the user to run it in a Flink instance. For further reference, the Listing A.4 contains the Java code of this actor. The diagram depicted in Fig. 5.3 is an IDAR Graph which summarizes how the translation into Flink source code takes place. IDAR Graphs are introduced in [91] and offer a way of representing how system components communicate and exchange information in a more readable way as compared to their UML equivalents. In an IDAR Graph, objects can communicate either sending a command message (control messages) or a non-command message, which is called a notice. An arrow with a bubble (circle) on its tail stands for an indirect method call. Other subsystems (which would be the superior item of another IDAR Graph) are represented with hexagons Automatic Mapper for Flink API When generating Flink code, the actors that have been defined in Section 5.2 take essentially two types of information, as it was explained in Chapter 4: The preferences that the user has specified using the mashup component properties. A generic structure of the statement that should be generated. 59

87 5. Implementation aflux Engine 2:instantiate 1:create aflux MainExecutor ActorSystem AbstractAFluxActor 3:start sendoutput sendoutput sendoutput sendoutput Actor #1 EnvironmentSetUpActor 7:setOutput Actor #n Actor #2 Actor #N ExecuteAndGenerateJobActor 8:generateJavaClass 10:package 4:getClassNameInstance, getmethodname, getchildclasses 6:addStatement, addnamed 9:FlinkJob.java Maven Invoker 5:Flink API Flink API Mapper CodeBlock.Builder Subsystem Object Data Flow Command (downstream) Notices (upstream) Indirect Notices aflux Core Flink Plug-in Figure 5.3.: Programming Flink from aflux To make the latter generic structure as generic as possible, the parts that correspond to the Flink API should not be hardcoded. Instead, they should be treated as constants and kept separately from the statement description. Extracting constants is a well-known practice in programming, because of its considerable maintainability. However, in this case there are more advantages: If an API changes, it just must be modified in the constant, without any change required in the actors using that API. If a new API wants to be supported, it is as easy as adding it as a new constant and defining the statement in the corresponding actors. 60

88 5.3. Automatic Mapper for Flink API As a consequence, it is clear that the first step is to extract all the constants together and keep them somewhere where they can be easily modified, updated and expanded. But keeping a set of constants that is manually configured can also become difficult to maintain if the number of constants gets too high. For this purpose, in this plug-in for aflux these constants are created automatically, by means of a mapper that processes the Flink API and generates them, just ready to be used by the actors. This way, whenever a new Flink API is to be supported, it is already available and no constant needs to be created for it. To automatically create this mapping between constants used by the actors and the corresponding Flink APIs involves some technical challenges. For starters, the approach should be scalable. For instance, if all the Flink APIs need to be analyzed each time any actor wants to invoke an API, then the application will not meet the requirements of responsiveness. To solve this, the Singleton design pattern [92] will be used to make sure that just one instance of the mapper ever exists. The mapper should run just once, and then keep the necessary information in memory for the actors to use them. Only in this way will the automatic mapper be efficient enough. Fig. A.3 depicts the UML diagram of the Mapper. Three main entities (implemented as separate Java classes) compose it: The first entity is a parser. It extracts the Flink source code and processes each of the source files in it to create an AST, which is a tree representation of the syntactic code structure of the source file. The second entity is an analyzer. It goes through an AST and searches for a node of interest. This node can be anything, from a class declaration to a method declaration to a variable definition. The third entity is the mapper itself. It takes the information that the analyzer outputs and keeps just the necessary part, which will be kept in memory during the whole execution. This is the external interface to the mapper: the one that the actors will be using. The rest of the entities remain as internal components. Each of these components will be now explained in more detail. But before that, an overview of the JavaParser library, which has been used as the core engine of the Mapper, will be given The JavaParser Library JavaParser is a Java library that provides a set of tools to process Java source code programmatically [93]. It is rather complex yet extremely powerful, and it offers four main features: parsing, analyzing, transforming and generating Java code [94]. However, just the first two will be used in this case. Note that JavaParser could also have been used to generate Java code in the implemented mashup components (see Section 5.2 for more details), instead of JavaPoet. 61

89 5. Implementation However, generating code with JavaParser is extremely more complex while JavaPoet offered a functionality that is simpler but easier to use and more than enough for what it is used for. JavaParser behaves just like the syntax checker and code generators of Integrated Development Environments (IDEs), like the Eclipse Abstract Syntax Tree [95]. As far as this thesis is concerned, these are the relevant components of JavaParser that will take part in the Mapper: The JavaParser class, which is capable of producing the AST from the code. The CompilationUnit class, which acts as the root of the AST. The Visitors, which are classes that may be used to find specific parts of the AST. These entities will be used in the engine of the Mapper, plus some business logic to make it fit the specific requirements about Flink and aflux Parsing the Flink API The FlinkApiParser class has one single, static method, namely processapizipfile(), which takes the source code of a given Flink distribution as input and produces a list of CompilationUnit, i.e. a list of ASTs that correspond to the analyzed source files. As it had to be done with the Flink project template (see Section 5.2.9), the source code of Flink will also be read from inside the aflux plug-in, to make it completely self-contained. However, no proper extraction needs to be done, as the goal here is to process the files, not to use them somewhere else (in the Flink project template, files had to be then processed by the Maven Invoker, making it necessary to extract the files to the filesystem). In this case, the files inside the.zip will be read but kept in memory, as there is no need to keep them longer. It is extremely useful to be able include the whole source distribution of Flink (which can be easily downloaded as a.zip file e.g. from their GitHub repository 7 ) in the resources folder of the Flink plug-in for aflux and have the FlinkApiParser take care of everything. But it is not necessary to process absolutely all the files inside the source distribution. Indeed, some files correspond to other non-java APIs (e.g. Spark APIs), or to the engine of Flink (i.e. no APIs at all). Even some Java APIs may not be currently supported by the implemented actors. In these cases, it makes sense to skip those files to make the parsing process more efficient, especially considering that a source distribution of Flink can be more than 25 MB big, with more than 8,000 files. To this end, the FlinkApiParser takes another parameter as well, which stands for the set of Maven artifacts 8 that are to be mapped. As of the moment of writing this thesis, the following artifacts are included in the Mapper: 7 [Online] 8 In Maven projects, an artifact stands can be both a project or a "subproject". In this section, it will be used as a synonym for "subproject", for instance the components of the Flink project. 62

90 5.3. Automatic Mapper for Flink API flink-connectors/flink-connector-kafka-0.8 flink-core flink-java flink-libraries/flink-cep flink-streaming-java But supporting a new one is as easy as adding it to the array that is passed as a parameter to the FlinkApiParser. The output of the FlinkApiParser is a List<CompilationUnit> which contains all the ASTs for the Java files in the specified artifacts Analyzing the Flink API Once the source files have been parsed and the AST is available, it can be analyzed by using a Visitor. A visitor allows focusing on a certain type of nodes in the AST, avoiding large loops over nodes that are not of interest. In JavaParser, visitors that do not make changes in the AST (just traverse it) are known as VoidVisitor. They can take a type parameter that is also passed to each of their methods, like some aggregator that is reused when going through the AST [93]. A key-value-like data structure can be used to keep known constants as keys (they will be used by the actors to include Flink APIs) and the real Flink APIs as values. Therefore, a Map will be used as aggregator type in all the visitors of the Mapper. The nodes of interest in the Mapper are the following ones: Declarations of classes, interfaces and enumerations, to retrieve its name and package, which in turn allows invoking the ClassName.get() method of JavaPoet to use it when generating code. Declarations of methods, which can be passed as String to JavaPoet to use them when generating code. The set of classes that inherit from a given one (either extending a parent class or implementing an interface), which can be used e.g. to show the user the available types of SmartSantanderObservation. Hence, three different visitors are required, one for each of the previous items: a ClassAndEnumVisitor, a MethodAndEnumConstantVisitor and a PolymorphismVisitor. Some of them have code in common (e.g. searching for a certain type of node upstream in the AST). For this reason, they are all included as static classes in a class named FlinkApiVisitors. 63

91 5. Implementation The ClassAndEnumVisitor Visitor #1 focuses on ClassOrInterfaceDeclaration and EnumDeclaration nodes in the AST, to get its name and search the AST upstream to find its PackageDeclaration. A class may be defined inside another class, so the package declaration could be close or not so close in the AST. This visitor handles all of that. The ClassAndEnumVisitor is triggered in every declaration of Class, Interface or Enum. It uses a Map<String, String> as aggregator type, which keeps the class name as key and the full package name as value. The MethodAndEnumConstantVisitor Visitor #2 focuses on MethodDeclaration and EnumConstantDeclaration nodes in the AST, to store their names. The MethodAndEnumConstantVisitor is triggered whenever a new method is defined or when a new constant is defined inside an enumeration. In other words, it focuses in a class children but skipping fields and declarations of other classes. It uses a Map<String, String> as aggregator type, which keeps the class name and child name as key and the child name as value. In the future, this could be improved to include any word in the child name as key, making an even easier access possible. The PolymorphismVisitor Visitor #3 focuses on finding the names of the classes that inherit from a given one. For this reason, it has a String field containing the name of the parent class to be searched for. The PolymorphismVisitor is triggered in every ClassOrInterfaceDeclaration. It collects the results in a List<String> which contains the names of the resulting classes The Flink API Mapper Finally, the FlinkApiMapper serves as the entry point to the Mapper. As stated above, it implements the Singleton design pattern, to make sure that there is only one instance of it, which is created when the plug-in is loaded (no lazy instantiation). This component has the FlinkApiParser parse the source code from both Flink and the SmartSantander connector. Then it instantiates the visitors in FlinkApiVisitors and retrieves its aggregators. As a special remark, note that the ClassName objects are not created directly in the ClassAndEnumVisitor. Instead, to save memory, they are lazily loaded whenever they are required, and kept in another Map that serves as a cache during execution, to avoid creating them again. Three methods are exposed to be used from the outside, each of them corresponding one of the three visitors. Here are their signatures: 64

92 5.4. End-User Continuous Support public ClassName getclassnameinstance(string simplename) public String getmethodname(string simplename) public List<ClassName> getchildclasses(string parentclasssimplename) 5.4. End-User Continuous Support In Sections 5.1 to 5.3, a set of mashup components have been presented. They enable end users to create Flink programs from the GUI of aflux, yet they are useless if the end user does not know how to combine them. There is a boundless number of possible combinations, but not all of them result in compilable Flink programs with no errors. For this reason, as discussed in Chapter 1, it seems to be necessary to include a mechanism to continuously support the end users of aflux while they are creating mashups. For this purpose, a simple yet powerful semantics language has been defined. This language is based on conditions that apply to two whichever mashup components in aflux, enforcing the order in which they should appear. The approach is the one that was defined in Chapter 4: Structure of a Semantics Condition in aflux MC A }{{} main should must }{{} component ismandatory come (immediately) }{{} isconsecutive before after }{{} MC B }{{} argument isprecedent component. As it can be seen, visual components denoted in Chapter 4 become mashup components (MC) in the context of aflux The ToolSemanticsCondition A POJO can be created to define conditions. The implemented approach should enable continuous validation of semantics among mashup components in a flexible and scalable way, so it makes sense to make this feature available for all aflux mashup components, not just for those in the Flink plug-in. As a consequence, the class ToolSemanticsCondition has been defined in the core of aflux, namely in the de.tum.in.aflux.tools.core package. Basically, it is treated just like the ToolProperty class that is used to define the component properties when extending AbstractMainExecutor. Although many-to-many relationships are usually implemented with a separate entity that contains the mapping information [96], in this case it is a design choice to define the conditions inside one of the mashup components (the main one or MC A ). This makes it extremely more intuitive: when a new mashup component is created, an array of 65

93 5. Implementation ToolSemanticsCondition can be specified, and they will be automatically enforced by the core of aflux. The ToolSemanticsCondition class maps the structure that was presented above, with the following considerations: The flags are implemented as boolean fields. The MC A do not need to be specified, because it is implicit when defining the condition inside a mashup component. The MC B is a reference to the class where that component has been defined. Hence, making use of Java Reflection, it has been implemented as an instance of a Class object, more specifically a Class with a bounded type parameter (see [97] for more details) Class<? extends AbstractMainExecutor>, to make sure that only aflux mashup components can be used when defining a semantics condition. Of course, an array of ToolSemanticsCondition must be added to the AbstractMainBase Tool class, so that semantics conditions can be defined in whichever aflux mashup component. Fig. 5.4 summarizes this relationship. Note that a new constructor has been defined, to support backwards compatibility. This means that the array of ToolSemanticsCondition may be defined using the new constructor or it may not. In the latter case, the old constructor will be used, and everything will remain working with no validation. With this approach, semantics validation among mashup components can be added to any component in aflux (not just those in the Flink plug-in), and if they are not added, everything will keep working Back-End Job Validation Not only should conditions be defined, but also validated. To enable this, a new endpoint in the aflux core has been defined, under the /jobs/validate URL. As it was indicated in Section 2.4, the web application of aflux is powered by the Spring framework. The business logic for all the endpoints under the /jobs URL is located in the FluxJobController Spring controller, so this is where the new endpoint (mapped to a validate() function) will be defined. When a POST request is sent to /jobs/validate, the job 9 in the context of the request will be processed by FluxJobController inside validate(), where it will be passed on to a new entity: a FluxJobValidator. This is how the FluxJobValidator behaves: Several remarks about Algorithm 1 need to be made: 9 Note that the word "job" here no longer refers to a Flink job, but to a FlowJob, which is used in the context of the aflux core to refer to the set of activities (each of them containing one or more mashups) that the end user defines in the GUI. 66

94 5.4. End-User Continuous Support Figure 5.4.: ToolSemanticsCondition in the aflux Tool Core Whenever a condition is not met, some information about it is added to an accumulator named result. This accumulator contains the details about all the conditions that failed for a given FlowElement, which are finally added to it to inform the user later. The elements in an activity are not ordered according to the order of the flow. Instead, they appear in the order in which they were created in the GUI. Therefore, they need to be ordered before performing the validation. In this step, just the elements that are connected (and hence part of the flow) are considered; the rest of them are ignored. 67

95 5. Implementation Algorithm 1: Job Validation in FluxJobValidator.java foreach activity in the job do order the list of element as they appear in the activity; foreach element in the orderedlist do instantiate the AbstractMainExecutor that corresponds to element; get the set of conditions out of it; instantiate a new result; foreach condition in conditions do foreach element in the orderedlist do if condition is not met then result.add(condition); end end end if result is empty then clear error information from element; else add error information to element; end end end In this approach, it has been assumed that there is only one flow per activity, or at least that one flow was created at a time. In other words, it is preferable that different flows are placed in different activities, but they could be included within the same activity as long as no other flow is created before the previous one is finished. Under these circumstances, the validation engine will work with no further issue. When adding the error information to a FlowElement, three different types of feedback are included. The first two types allow the user to realize about the errors, regardless of the fact whether they have clicked on the element with errors or not. Should the user want to know more details, that is to say, which conditions failed exactly, the third type of feedback takes part: The color of the element is changed from blue to red. The name of the element gets an asterisk ("(*)") appended to it. A message containing details about each of the conditions that failed to be met is added to a new String field in FlowElement, named errors. In the next section, the way in which the error information is later used by the front-end to let the user know about the validation errors is presented. 68

96 5.4. End-User Continuous Support Front-End Feedback As explained in Section 2.4, the front-end of aflux is based on a React + Redux application, so React s one-way data binding applies. Each user action or event is associated with an action that is then processed and triggers some change in visualization, which is re-rendered by React. It is a design decision to trigger validation of the flow whenever two flow elements are connected (see Section 4.2 for more details). Consequently, this user event should trigger a new action, named "validatejob" that sends the whole job to the back-end (specifically to /jobs/validate, the endpoint defined in Section 5.4.2). When a response is received, which happens asynchronously because of the way JavaScript works, the whole job is replaced by the received one, and hence the errors are shown. Thereby, the main changes in aflux s front-end to support semantics validation among nodes are: A new div section in the ReduxNodeProperties component, to map the errors attribute of a FlowElement. It is rendered exactly the same way as the div with the hints of a mashup component, but in red color. The details of the conditions that failed will be shown here, in case there is any. A new action to trigger the job validation. More specifically, it is not a brand-new action, but a new "subtype" of generalaction, because in aflux s front-end all user actions are mapped to the same single action. The action behaves as follows: 1. First, an internal action is triggered to refresh the activities in the state and make them meet the ones that are actually in the canvas. This is required because the front-end application allows to keep newly-created flows separately in the state until the user clicks on the "save" button, yet the structure of this temporary flow is different from that of a saved one and lacks some of the information that is required for the job to be validated. 2. Then, another internal action is triggered to refresh the activities inside the job in the state and make them meet the ones in the previous step. For some reason, this is not done automatically and needs to be ordered explicitly. 3. After this, the "validatejob" action itself is triggered for dispatch, and an asynchronous request to the back-end is made. This request behaves similarly as the one sent to save a job: it includes the current job in the body of the request. 4. The callback of the request is in charge of replacing the activities in the state by those received in the response. This way, when each of the flow elements is rendered, the error information will be shown to the end-user. Fig. 5.5 shows how errors are fed back to the end user. As data cannot be filtered by location unless some data source has been defined before (in other words, "GPS Filter should come immediately after SmartSntndr Data"), the validation throws an error, and 69

5. Implementation the flow element gets red, an "(*)" appended to its name and details about the errors in the right-side bar. Figure 5.5.: Validation Errors Rendered in aflux s Front-End 5.4.

97 5. Implementation the flow element gets red, an "(*)" appended to its name and details about the errors in the right-side bar. Figure 5.5.: Validation Errors Rendered in aflux s Front-End Using Conditions in Mashup Components Now that a full engine for enforcing semantics validation between flow elements, the only remaining step is to define the conditions that govern the Flink plug-in. These semantics are directly derived from the conditions that the Flink API enforces. Table 5.14 contains all the conditions that have been defined for the mashup components in the Flink plug-in for aflux. These conditions ensure that compilable programs will be created. The user will get continuous support on what they are doing wrong, so that they can fix whatever is necessary to successfully create a Flink job. The diagram depicted in Fig. 5.6 is an IDAR Graph 10 which summarizes how the front-end and back-end interact to enable continuous end-user support via semantics validation. 10 Please refer to the end of Section 5.2 and to [91] for more details about how to interpret it. 70

98 5.4. End-User Continuous Support Table 5.14.: Semantics Conditions in the Mashup Components of the Flink Plug-in MC A should/ must come (immediately) before/ after MC B Begin Job must come - before End Job SmartSntndr Data must come immediately after Begin Job SmartSntndr Data should come immediately before GPS Filter GPS Filter must come immediately after SmartSntndr Data Select must come - after SmartSntndr Data Window must come immediately after Select Window must come immediately before Window Op. Window Op. must come immediately after Window Output Result must come. after SmartSntndr Data CEP Begin should come - after GPS Filter CEP Begin should come - after SmartSntndr Data CEP Begin must come immediately before CEP New Patt. CEP Begin must come - before CEP end CEP New Patt. must come - after CEP Begin CEP New Patt. must come immediately before CEP Add Cond. CEP Add Cond. must come - after CEP Begin CEP Add Cond. must come - after CEP New Patt. CEP end must come - after CEP Begin End Job must come - after Begin Job 71

99 5. Implementation ReduxWorkspaceContainer FlowJobController 1:dispatch(validateJob), 12: dispatch(setflowactivities) 11:newState 3:validate Redux Store 10: validatedjob 13:redraw 2:validateJob FlowJobValidator React Component AsyncFlowsAppReducer 4:instantiate, 6:getConditions 5: newinstance, 7: SemanticsConditions 8:setErrors 9:changedColor, changedname, changederrormessage AbstractMainExecutor FlowElement Subsystem Object Data Flow Command (downstream) Notices (upstream) Indirect Notices Front-End Back-End Figure 5.6.: End-User Continuous Support 72

100 6. Evaluation In this chapter, the implemented approach is evaluated to assess how easy it is to create Flink jobs graphically from aflux. First, the evaluation scenario is presented. After that, the obtained results are described and discussed The Evaluation Scenario Throughout this document it has been argued that the city of Santander can be considered a particular case in which the outcomes of this thesis may be applied. Since it is a Smart City, taking it as the evaluation context makes it possible to extrapolate the results to other use cases and scenarios in which IoT applications play an important role. If the city hall in Santander decided to deploy aflux as one of the tools available to their analysts and decision makers, the functionality that the implemented approach would bring to the city is summarized in the use case diagram depicted in Fig The following items can be identified: aflux Analyze Data in Real Time «service» SmartSantander Live Data City Hall Analyst Detect a Pattern in Real Time Figure 6.1.: The Evaluation Scenario The main actor 1 is an analyst of the City Hall of Santander. This person could not have programming skills, and even if they did they could not have any 1 This actor has nothing to do with the actors of an Actor System described in previous sections of this document (Akka actors). In this case, it refers to the UML terminology, in which actors are simply people or external systems that interact with the main system (aflux) [98]. 73

101 6. Evaluation knowledge about Big Data platforms, Stream Analytics and Flink. They just need to know how to use aflux from the end-user perspective (drag and drop mashup components, etc.) and have some very basic information on what Flink can do, from a functionality point of view rather than from a developer point of view. For example, the City Hall analyst should know that changes in the city are measured in events and events can be processed in groups called windows. They do not need to know any details about how to create a window in the Flink Java API, or the fact that Java generics need to be used when defining the type of window. Two use cases can be identified, i.e. there are two types of operations that the analyst can perform using the Flink plug-in for aflux: Analyze data in real time. This involves combining data from different sources of the city and processing it somehow. The goal of this use case is to gain insights about the city, that help decision makers in the City Hall take the appropriate calls. For instance, if the air quality is below the desired levels, the City Hall would probably wonder whether it is a good decision to restrict traffic in the city. To prepare a report to their supervisor, an analyst could create a program to emit the air quality and traffic charge of a certain area and see if they are related (hence it is a good decision to restrict traffic) or not (hence restricting traffic will have no impact on the air quality). Detect a pattern of events. In this case, the goal is not to gain insights on the data, but to have the system notify the City Hall whenever the pattern is detected. The analyst would only have to define the pattern to search and the type of notification to be sent in case of a match and the system would take care of the rest. For instance, the analyst could define a pattern to detect a progressive increment in the traffic charge in a certain area of the city. If this happens, they would get a notification to e.g. send a police car to regulate traffic. The main system, i.e. the system under evaluation is aflux. The SmartSantander API external service is required to retrieve the live data in both use cases. Note that it would be up to the staff in the City Hall to decide which actions to take in light of the program results. Therefore, it is the ease and simplicity of creation of Flink programs that will be evaluated in this chapter, and not the way in which the decisions are taken after running them, which exceeds the scope of Data Analytics. 74

102 6.2. Overall Considerations 6.2. Overall Considerations The following considerations apply to all the experiments that have been conducted: The "Output Result" component has been set up to CSV files, so that data can be plotted to make results more illustrative. All jobs have been run in a local Flink cluster [99]. For the sake of readability, just the most important configuration parameters of mashup components will be highlighted. The rest of the parameters which are implicit in the description of the experiment may be configured as described in Chapter 5. Otherwise, default values should be used. Evaluation Execution Platform The experiments presented in this chapter have been tested in a local deployment of aflux and a local instance of Flink The execution platform was based on a Windows 10 Operating System running on an i7-7500u-intel-processor laptop with 8 GB of RAM Use Case 1: Real Time Data Analysis To evaluate this use case, four experiments have been conducted. Use Case 1 Details Use Case Name: Real Time Data Analysis Main Flink APIs Under Evaluation: AggregateFunction, AllWindowedStream, DataSink, DataSource, DataStream, EventTime, FilterFunction, MapFunction, RichSourceFunction, SlidingEventTimeWindows, StreamExecutionEnvironment, TumblingEventTimeWindows Experiment 1 In this example, temperature vs. air quality in a certain area is to be compared with the average of the city: 75

103 6. Evaluation Use Case 1 - Experiment 1 Overview: temperature vs. air quality / certain area vs. city average. Questions to Answer: How are temperature and air pollution related? Do the conditions of the area in comparison with the rest of the city affect this relation? To study the relation between the level of a certain gas and temperature, the analyst needs to create four flows, (or wire them all together to create a simple Flink job): two of them will be analyzing temperature data (i.e. the temperature attribute in the environment dataset) and two of them will be analyzing air quality (e.g. the levelofco attribute in the airquality dataset). Two flows are required for each dataset because one will include a "GPS filter" component (like in Fig. 6.2a), and the other one will not include it to process all the data in the city (see Fig. 6.2b). To avoid adding the same mashup components again and again, the analyst could make use of the subflow feature of aflux. (a) Flow A (b) Flow B Figure 6.2.: Flows in aflux for Use Case 1 Adding More Data Sources to the Analysis Fig. 6.4 shows how the analyst can get input from the real-time data that they configured. Adding a third source of data, to see not only the level of NO 2 but also the level of ozone is as simple as changing a property in the "SmartSntndr Data" component. However, if they were doing it manually, the Java code for a new MapFunction would have to be created, as shown in Listing 6.1. Listing 6.1: Required Code to Select Two Different Gas Levels 1 DataStream<Double> levelofno2 = filteredairquality.map(new MapFunction< AirQualityObservation, Double>() { 3 public Double map(airqualityobservation airqualityobservation) throws Exception { 76

6.3. Use Case 1: Real Time Data Analysis 4 return Double.valueOf(airQualityObservation.getLevelOfNO2()); 5 } 6 }); 7 DataStream<Double> levelofozone = filteredairquality.

getLevelOfOzone()); 11 } 12 }); Changing the Type of Window Tumbling windows were used in Fig. 6.4, but processing the data in a different type of window (e.g. using sliding windows) is as easy as changing the properties of the "Window" mashup component (see Fig.

104 6.3. Use Case 1: Real Time Data Analysis 4 return Double.valueOf(airQualityObservation.getLevelOfNO2()); 5 } 6 }); 7 DataStream<Double> levelofozone = filteredairquality.map(new MapFunction< AirQualityObservation, Double>() { 9 public Double map(airqualityobservation airqualityobservation) throws Exception { 10 return Double.valueOf(airQualityObservation.getLevelOfOzone()); 11 } 12 }); Changing the Type of Window Tumbling windows were used in Fig. 6.4, but processing the data in a different type of window (e.g. using sliding windows) is as easy as changing the properties of the "Window" mashup component (see Fig. 6.3). In Java, the user would need to know that a sliding window takes an extra parameter, and that the window slide needs to be specified using Flink s Time class, in which a different method is invoked depending on the time units that they desire (see Listing 6.2). (a) Tumbling Windows (b) Sliding Windows Figure 6.3.: Tumbling vs. Sliding Windows in aflux 77

105 6. Evaluation Listing 6.2: Tumbling vs. Sliding Windows in Java 1 AllWindowedStream<Double, TimeWindow> tumblingwindowedtemperature = temperature. windowall(tumblingeventtimewindows.of(time.minutes(5))); 2 AllWindowedStream<Double, TimeWindow> tumblingwindowedlevelofno2 = levelofno2. windowall(tumblingeventtimewindows.of(time.minutes(5))); 3 4 AllWindowedStream<Double, TimeWindow> slidingwindowedtemperature = temperature. windowall(slidingeventtimewindows.of(time.minutes(5), Time.minutes(1))); 5 AllWindowedStream<Double, TimeWindow> slidingwindowedlevelofno2 = levelofno2. windowall(slidingeventtimewindows.of(time.minutes(5), Time.minutes(1))); Fig. 6.5 depicts the results when a sliding window is selected Experiment 2 In this experiment, traffic charge is to be compared with the air quality in a specific area in the city: Use Case 1 - Experiment 2 Overview: air vs. traffic charge in a specific area. Questions to Answer: How are air pollution and traffic in a certain area related? Does it make sense to limit traffic so as to reduce pollution? Again, to compare the results with tumbling and sliding windows is straightforward. Figs. 6.7 to 6.8 show the results. Two flows are required, following the structure of Fig. 6.2a. Changing the Area Under Analysis If the analyst were to change the location of the area that is being analyzed, doing it from aflux is as easy as drag-and-dropping in the properties panel of the "GPS Filter" mashup component and choosing the desired radius. However, if this had to be done manually, the analyst should have to check the specific latitude and longitude of the area and change it in the FilterFunction. Listing 6.3: Changing Filtering Location in Java 1 DataStream<TrafficObservation> filteredtraffic = trafficobservationdatastream.filter( new FilterFunction<TrafficObservation>() { 3 public boolean filter(trafficobservation input) throws Exception { 4 final double EARTH_RADIUS = 6371; 5 double reflat = ; // Enter desired latitude here 6 double reflng = ; // Enter desired longitude here 7 double radius = 1; // Enter desired radius here 8 double currentlat = input.getlatitude(); 9 double currentlng = input.getlongitude(); 78

106 6.3. Use Case 1: Real Time Data Analysis Temperature ( C) NO2 (µ/m 3 ) Specified Area City Average 17:45:00 18:00:00 18:15:00 18:30:00 18:45:00 Time Specified Area City Average Ozone (µ/m 3 ) :45:00 18:00:00 18:15:00 18:30:00 18:45:00 Time Specified Area City Average 17:45:00 18:00:00 18:15:00 18:30:00 18:45:00 Time Figure 6.4.: Use Case 1, Experiment 1. Specified Area vs. City Average. Live data from SmartSantander 9 th July Tumbling Windows: size=5min. 79

107 6. Evaluation Temperature ( C) NO2 (µ/m 3 ) Specified Area City Average 17:45:00 18:00:00 18:15:00 18:30:00 18:45:00 Time Specified Area City Average Ozone (µ/m 3 ) :45:00 18:00:00 18:15:00 18:30:00 18:45:00 Time Specified Area City Average 17:45:00 18:00:00 18:15:00 18:30:00 18:45:00 Time Figure 6.5.: Use Case 1, Experiment 1. Specified Area vs. City Average. Live data from SmartSantander 9 th July Sliding Windows: size=5min, slide=1min. 80

pow(sindLat, 2) + Math.pow(sindLng, 2)* Math.cos(Math. toradians(currentlat)) * Math.cos(Math.toRadians(refLat)); 15 double va2 = 2 * Math.atan2(Math.sqrt(va1), Math.

108 6.3. Use Case 1: Real Time Data Analysis (a) Previous Location (b) New Location Figure 6.6.: Changing Filtering Location from aflux 10 double dlat = Math.toRadians(refLat - currentlat); 11 double dlng = Math.toRadians(refLng - currentlng); 12 double sindlat = Math.sin(dLat / 2); 13 double sindlng = Math.sin(dLng / 2); 14 double va1 = Math.pow(sindLat, 2) + Math.pow(sindLng, 2)* Math.cos(Math. toradians(currentlat)) * Math.cos(Math.toRadians(refLat)); 15 double va2 = 2 * Math.atan2(Math.sqrt(va1), Math.sqrt(1 - va1)); 16 double distance = EARTH_RADIUS * va2; 17 System.out.println("TA Distance from " + currentlat + " // " + currentlng + " : " + distance); 18 return distance < radius;} 19 }); As it can be seen, modifying the desired location programmatically can become cumbersome, while it can be done with no effort if using the GUI Experiment 3 In this experiment, traffic charge is to be compared with the noise level in a specific area in the city: 81

109 6. Evaluation Use Case 1 - Experiment 3 Overview: noise vs. traffic charge in a specific area. Questions to Answer: How are noise and traffic in an area related to each other? Does it make sense to limit traffic so as to reduce noise? This experiment is equivalent to the Experiment 2 in Section 6.3.2, but in this case the noise level is under analysis. The results are depicted in Figs. 6.9 to Experiment 4 In this experiment, the objective is to create a real-time monitoring system of different parameters of the city, e.g. to help the City Hall keep track of them. Maximum and minimum values will be monitored and output. Use Case 1 - Experiment 4 Overview: max/min values real-time monitor Questions to Answer: How can maximum and minimum temperature and air quality be continuously monitored? In this case, a flow like the one depicted in Fig. 6.2a (or Fig. 6.2b if the whole city is to be monitored) is required for each of the magnitudes to monitor. With aflux, this is extremely easy because flows can be reused by means of subflows. Changing Window Operations In this experiment, the focus is on proving the easiness of applying different times of operations to windows. If it had to be done manually, then the analyst should have to know that some operations are built-in into the core of Flink, while others must be coded separately and then passed as a parameter to another built-in function (see Listing 6.4 for more details). However, in the GUI of aflux it is as easy as changing the properties of the "Window Operation" mashup component, as shown in Fig Listing 6.4: Different Window Operations in Java 1 DataStream<Double> maxtemp = windowedtemperature.max(0); 2 DataStream<Double> mintemp = windowedtemperature.min(0); 3 DataStream<Double> avgtemp = windowedtemperature.aggregate(new AverageAggregate()); The results of this experiment are depicted in Figs to

110 6.3. Use Case 1: Real Time Data Analysis Traffic (%) NO2 (µ/m 3 ) Ozone (µ/m 3 ) :45:00 18:00:00 18:15:00 Time 17:45:00 18:00:00 18:15:00 Time 17:45:00 18:00:00 18:15:00 Time Figure 6.7.: Use Case 1, Experiment 2. Vehicles Pollution in Selected Area. Live data from SmartSantander 9 th July Tumbling Windows: size=1min. 83

111 6. Evaluation Traffic (%) NO2 (µ/m 3 ) Ozone (µ/m 3 ) :45:00 18:00:00 18:15:00 18:30:00 18:45:00 19:00:00 Time 17:45:00 18:00:00 18:15:00 18:30:00 18:45:00 19:00:00 Time 17:45:00 18:00:00 18:15:00 18:30:00 18:45:00 19:00:00 Time Figure 6.8.: Use Case 1, Experiment 2. Vehicles Pollution in Selected Area. Live data from SmartSantander 9 th July Sliding Windows: size=1min, slide=30s. 84

112 6.3. Use Case 1: Real Time Data Analysis Traffic (%) Noise Level (db) :35:00 17:40:00 17:45:00 17:50:00 17:55: Time 50 17:35:00 17:40:00 17:45:00 17:50:00 17:55:00 Time Figure 6.9.: Use Case 1, Experiment 3. Acoustic Pollution in Selected Area. Live data from SmartSantander 9 th July Tumbling Windows: size=1min. 85

113 6. Evaluation Traffic (%) Noise Level (db) 20 17:35:00 17:40:00 17:45:00 17:50:00 17:55:00 18:00:00 18:05:00 18:10:00 Time :35:00 17:40:00 17:45:00 17:50:00 17:55:00 18:00:00 18:05:00 18:10:00 Time Figure 6.10.: Use Case 1, Experiment 3. Acoustic Pollution in Selected Area. Live data from SmartSantander 9 th July Sliding Windows: size=1min, slide=30s. 86

114 6.4. Use Case 2: Pattern Detection (a) Built-in Window Operations (b) Custom Window Operations Figure 6.11.: Different Window Operations in aflux Use Case 1 Summary In these four experiments, the end user was able to create a program that triggers data from an external API in real time, filters it by location, processes them in different types of windows and aggregates them with different operations. Changing the type of window, the filtering location, the window operation and adding more sources to the analysis requires not a single line of source code to be written Use Case 2: Pattern Detection To evaluate this use case, two experiments have been conducted. Use Case 2 Details Use Case Name: Pattern Detection Main Flink APIs Under Evaluation: AfterMatchSkipStrategy, DataSink, DataSource, DataStream, EventTime, FilterFunction, PatternSelectFunction, PatternStream, Pattern, RichSourceFunction, SimpleCondition, StreamExecutionEnvironment Fig depicts the flow that has been employed in this use case in its simplest form, for the purpose of readability. Whenever more conditions/patterns have been necessary, more "CEP new patt." and "CEP add condition" components have been added. 87

115 6. Evaluation CO (mg/m 3 ) Ozone (µ/m 3 ) NO2 (µ/m 3 ) Temperature ( C) Max Min 17:55:00 18:20:00 18:45:00 19:10:00 19:35:00 20:00:00 Time Max Min 17:55:00 18:20:00 18:45:00 19:10:00 19:35:00 20:00:00 Time Max Min 17:55:00 18:20:00 18:45:00 19:10:00 19:35:00 20:00:00 Time Max Min 17:55:00 18:20:00 18:45:00 19:10:00 19:35:00 20:00:00 Time Figure 6.12.: Use Case 1, Experiment 4. Real-Time Monitor. Live data from SmartSantander 9 th July Tumbling Windows: size=5min. 88

116 6.4. Use Case 2: Pattern Detection CO (mg/m 3 ) Ozone (µ/m 3 ) NO2 (µ/m 3 ) Temperature ( C) Max Min 17:55:00 18:20:00 18:45:00 19:10:00 19:35:00 20:00:00 Time Max Min 17:55:00 18:20:00 18:45:00 19:10:00 19:35:00 20:00:00 Time Max Min 17:55:00 18:20:00 18:45:00 19:10:00 19:35:00 20:00:00 Time Max Min 17:55:00 18:20:00 18:45:00 19:10:00 19:35:00 20:00:00 Time Figure 6.13.: Use Case 1, Experiment 4. Real-Time Monitor. Live data from SmartSantander 9 th July Sliding Windows: size=5min, slide=1min. 89

117 6. Evaluation Figure 6.14.: Flow in aflux for Use Case 2 (simplified) Experiment 1 In this experiment, the analyst wants to detect a sudden increase in the traffic charge of a certain area. To this purpose, a CEP pattern sequence could be created to find traffic events in which charge goes over, for instance, 50%. The analyst wants to target an increase, so they might want to configure a second pattern to find events with traffic charge over 60% within 10 minutes from the previous one. A third pattern could be defined to search for a traffic charge amounting to 75% within another 10 minutes. Use Case 2 - Experiment 1 Overview: detect sudden traffic increase in a certain area in the city Example Uses: instantly send a police car to a congested area, re-configure traffic lights to decrease traffic charge, create a diversion, etc. If this had to be done in Java, just defining the pattern sequence would require a source code like the one shown in Listing 6.5. Listing 6.5: Required Code to Create a Pattern that Detects Traffic Jams 1 AfterMatchSkipStrategy strat = AfterMatchSkipStrategy.noSkip(); 2 Pattern<TrafficObservation, TrafficObservation> mypattern = Pattern.< TrafficObservation>begin("start", strat) 3.where(new SimpleCondition<TrafficObservation>() { 5 public boolean filter(trafficobservation trafficobservation) throws Exception { 6 if (trafficobservation.getcharge() >= 50) 7 return true; 8 return false; 9 } 10 }).followedby("middle") 11.where(new SimpleCondition<TrafficObservation>() { 13 public boolean filter(trafficobservation trafficobservation) throws Exception { 14 if (trafficobservation.getcharge() >= 60) 90

118 6.4. Use Case 2: Pattern Detection 15 return true; 16 return false; 17 } 18 }).within(time.minutes(10)) 19.followedBy("end").where(new SimpleCondition<TrafficObservation>() { 21 public boolean filter(trafficobservation trafficobservation) throws Exception { 22 if (trafficobservation.getcharge() >= 75) 23 return true; 24 return false; 25 } 26 }).within(time.minutes(10)); PatternStream<TrafficObservation> patternstream = CEP.pattern(filteredTraffic, mypattern); DataStream<SmartSantanderAlert> alerts = patternstream.select(new PatternSelectFunction<TrafficObservation, SmartSantanderAlert>() { 32 public SmartSantanderAlert select(map<string, List<TrafficObservation>> map) throws Exception { 33 TrafficObservation event = map.get("end").get(0); 34 return new SmartSantanderAlert("Charge went too high in " + event.tostring() ); 35 } 36 }); However, in aflux this can be easily done by configuring the right properties in the "CEP" mashup components, like the ones shown in Fig (a) "CEP Begin" (b) "CEP Patt." New (c) "CEP Add Condition" (d) "CEP End" Figure 6.15.: Sample Configuration of Components in aflux for Pattern Detection The results of this experiment are depicted in Fig

119 6. Evaluation Experiment 2 In this experiment, extreme temperatures are to be detected: Use Case 2 - Experiment 2 Overview: detect extreme temperatures in the city (i.e. both lower and upper bound) Example Uses: notify the authorities, turn on the city AC facilities, warn the population in case of a heatwave, etc. This experiment is equivalent to the Experiment 1 in Section 6.4.1, but in this case the temperature is under analysis. The results are depicted in Fig Use Case 2 Summary In these two experiments, the end user was able to create a program that detects in real time a specific event pattern in the whole data stream, using CEP. This pattern can be as complex as desired, include repetitions of both conditions and patterns, make them optional, etc. The end user may make all these configurations, add different conditions, patterns, and so on without a single line of source code: just configuring the right properties in the GUI of aflux End-User Continuous Support Evaluation In this last section, the end-user continuous support that has been detailed in Section 5.4 is evaluated. To do that, the flow depicted in Fig. 6.2b has been created step by step, to see how the continuous support behaves. The process is shown in Fig. 6.18: 1. When adding the "Begin Job" component (Fig. 6.18a), no error is shown. The system is waiting for the user to connect a new component. 2. When the "SmartSntndr Data" component is connected (Fig. 6.18b), the system realizes that one of the conditions for "Begin Job" fails: it must be before an "End Job" component. The user gets three types of feedback: the component with errors turns red, its name gets an asterisk ("*") appended to it and, when clicked, the properties panel on the right shows the details about the errors, i.e. tells the user how to fix them. This error will remain until an "End Job" component is added. 3. Adding a "Select" component (Fig. 6.18c triggers no errors, as all its conditions are met. 92

120 6.5. End-User Continuous Support Evaluation Traffic (%) Traffic Events Alerts detected by CEP 17:45:00 18:00:00 18:15:00 18:30:00 Time Figure 6.16.: Use Case 2, Experiment 1. Traffic Events and Alerts on High Charge in Sensor #1018. Live data from SmartSantander 9 th July Temperature ( C) Temperature Events 25 Alerts detected by CEP 24 15:30:00 15:45:00 16:00:00 16:15:00 16:30:00 Time Figure 6.17.: Use Case 2, Experiment 2. Temperature Events and Alerts on Risky Values in Sensor #616. Live data from SmartSantander 9 th July

121 6. Evaluation (a) Step 1 (b) Step 2 (c) Step 3 (d) Step 4 (e) Step 5 (f) Step 6 (g) Step 7 Figure 6.18.: Step-by-Step Flow Composition in aflux 94

6.6. Critical Discussion 4. However, when a "Window" component is added (Fig. 6.18d), a new error appears: this component must come immediately before a "Window Operation" component.

122 6.6. Critical Discussion 4. However, when a "Window" component is added (Fig. 6.18d), a new error appears: this component must come immediately before a "Window Operation" component. Once again, the user gets three types of feedback; Fig shows how the error details look like. By reading this information, the end user learns about the error, and fixing it is as intuitive as the message states: not until the required component is added (Fig. 6.18e) will the error disappear. 5. Just like in Step 3, adding an "Output Result" component (Fig. 6.18f) triggers no errors. 6. Finally, when the "End Job" component is added, the error in "Begin Job" disappears (Fig. 6.18g), just like the error information was claiming. Figure 6.19.: Details About the Errors in the "Window" Component Continuous Support Evaluation Summary As it has been shown, the user gets constant feedback on whether or not the graphical flow they are creating in the GUI of aflux has errors. By using three complementary types of feedback, the user learns about these errors instantly and at a glance, and more details on how to fix the error are just one click away. Clearly, this kind of user-friendly support is not available when writing source code Critical Discussion In light of the results of the evaluation, it can be stated that designing pipelines for Stream Analytics graphically can be done in a very easy, user-friendly way by means of the implemented approach. An analyst can easily monitor all the available magnitudes and even learn about failures in some nodes (e.g. maximum being too high or minimum being too low). 95

123 6. Evaluation For starters, the GUI of aflux makes it extremely intuitive to create programs, as just drag-and-dropping visual mashup components and wiring them altogether is enough to develop an application. Mashups are indeed simple enough for end users to understand how they work, yet they enable the creation of complex programs. When programing Stream Analytics for Flink from aflux, a wide range of functionalities and configuration options are available for the end user. In fact, in total six real-world, complete examples targeting the city of Santander have been presented, and they could easily be extrapolated to any other smart city. It is clear that Stream Analytics in general, and the mapped Flink functionalities and configuration options in particular, suit applications for Smart Cities like a glove. This can be extrapolated even further, because in most IoT scenarios (like Smart Agriculture, ehealth, Industry 4.0, etc.), events occur in a flow-like structure that encourages the use of Stream Analytics. And the same happens with other non-iot scenarios, like the stock market. As a consequence, the implemented approach is not only valid for the city of Santander, but also for any other smart city as well as most IoT scenarios and even some non-iot scenarios. For instance, CEP could be extremely useful in ehealth, to notify a doctor whenever a certain event pattern takes place. Applicability The implemented approach has proved to be useful for Smart Cities, but it could be easily extrapolated to virtually any other IoT scenario, because in most IoT applications events occur as a stream. By using aflux and the Flink plug-in for aflux, any end-user can easily create runnable jobs. End users do not need to have any knowledge about coding, but even developers can find this tool useful. Indeed, the end-user continuous support feature makes the whole IoT process development effortless. Among the endless number of combinations of mashup components that can be created to program Flink, some of them are not valid, because they would result in compile or run-time errors. To prevent this, aflux warns end users about mashup components that have been wrong placed. Of course, graphical programming in general and aflux in particular have some downsides. As most logic is abstracted from end users, they may just access the set of available configuration options. Consequently, there is a trade-off between the level of granularity that they may reach using the GUI and the level of abstraction they can benefit from, i.e. the ease of use. In fact, even if more options were supported, coding would still remain as the most flexible and customizable approach to programming. Besides, not all Flink APIs are supported in the aforementioned aflux plug-in. Although the implemented approach covers a wide number of functionalities that enable targeting many different experiments in a smart city environment, there are still some Flink operations which cannot be performed from aflux. For instance, Flink supports Batch Analytics apart from Stream Analytics, both to perform MapReduce operations and to navigate data by means of an SQL-like API. None of this can be currently 96

124 6.6. Critical Discussion programmed from aflux. Other limitations of the implemented approach are caused by Flink itself, not by the implementation in this thesis. For instance, machine learning operations cannot be applied on data, because the FlinkML library does not feature a Java API [100]. Limitations The approach described in this thesis has limitations as well. Some of them originate in mashups, like the trade-off between the level of abstraction i.e. ease of use and the flexibility that the mashup tool provides. On the other hand, using just Java limits the range of APIs that can be accessed, for instance Machine Learning algorithms, which are currently just available in Scala. Apart from this, the Flink plug-in does not support all Java APIs, so some jobs (like Batch Analytics jobs) should have to be coded manually. Fortunately, the conceptual approach that was designed for this thesis (see Chapter 4) is generic and easily expandable. For this reason, the APIs that are not currently supported (e.g. Batch Analytics) could be added to the Flink plug-in at no extra cost, i.e. in a scalable way. The APIs Mapper presented in Section 5.3 actually enables an extremely easy integration of new or changing APIs. This means that, in spite of the fact that just some Flink APIs are currently supported (more spefically, those that can be most useful for the scenario of a Smart City), virtually any Flink API could be effortlessly supported if it were required. Finally, it is also important to highlight that the end-user continuous support can be used in any mashup components in aflux (not just Flink components), so the tool has been notably improved. Overall Remark The approach presented in this thesis is generic and easily expandable. For this reason, Flink APIs that are not currently supported could be integrated in aflux at no cost. Besides, the end-user continuous support mechanism improves the overall usability of aflux. 97

125

126 7. Conclusions IoT applications need increasingly an integration with Big Data platforms, to perform analytics and gain insights from the huge amounts of data that are produced by millions of sensors in every-minute events. There are different approaches to Big Data analytics, but Stream Analytics emerges as the most thoughtful one when it comes to real-time streams of events. However, IoT mashup tools, which are widely used because they ease the process of developing IoT applications, lack the integration of Big Data, especially Stream Analytics. It would be therefore desirable that these tools supported the creation of Stream Analytics programs, in spite of the challenges that such a task would entail. This is the point where this thesis began from. To make this desire a reality, the problem should be narrowed down before approaching it. This is why, in this thesis, just some Stream Analytics functionalities have been considered. The process of Analytics itself is also not done from the mashup tool, but from an external software platform instead, and it is this platform that is programed from the mashup tool. By leveraging a smart approach that enables a generic, expandable, flexible framework, the required functionalities of this software platform Apache Flink can be integrated into the mashup tool aflux, while ensuring that the rest ones could be integrated at no cost. This is one of the main achievements of this thesis. A set of data and logic mashup components have been developed, and they can be combined in virtually any way to create an endless number of hybrid mashups. But, regardless of its complexity and ease of use, offering a tool to end users that allows them to program such an elaborated software platform is simply not enough: they also need support on how to do it. Otherwise, no matter how powerful the approach is, it will be worth nothing if nobody can make use of it. For this purpose, a continuous support and feedback mechanism has been integrated into the mashup tool as well and it has been proved to be successful in making the whole development process of IoT applications definitely simple and straightforward. Just like any engineering project, this thesis has not been devoid of technical problems. In fact, many of them had to be faced and worked around, as it has been highlighted through this document. These problems range from the disappointing fact of the FIWARE Context Broker not receiving the latest data, to the lack of documentation about aflux, and the challenge to learn how Flink works before it could be modularized into aflux. But they could finally be overcome in the course of this thesis. All things considered, the questions that were stated at the beginning of this document have been responded, by means of a generic, expandable, flexible approach to create Flink programs graphically and a mechanism to constantly provide feedback to end users in how well they are doing so. 99

127 7. Conclusions 7.1. Main Contributions In terms of implementation, these are the most relevant outcomes of this thesis: 1. A plug-in for aflux that enables the creation of Flink programs graphically. The source code is fully documented and is available in the GitHub repository of the Flink Plug-in for aflux. 2. A Flink connector that enables the use of the data from SmartSantander to be easily used in Flink applications, by abstracting all the logic required to retrieve those data. The source is fully documented and part of an open source project and is available in this Flink fork in GitHub. Of course, this connector will be contributed to the Flink community. 3. A set of improvements in the engine of aflux (both front-end and back-end) that enable end-user continuous support by means of the validation of graphical flows. The source code is fully documented and is available in the aflux GitHub repository. Apart from these, the conceptual models of how to translate a graphical flow to runnable code and how to enforce semantics among mashup components are also a contribution to research. The former gives a response to the first research question that was presented in Section 1.2, plus the first two items above. The latter conceptual model, plus the last item above, apply to the second research question Future Work The limitations presented in Section 6.6 result in several future opportunities to improve the approach presented in this thesis: Increase the amount of Flink APIs that are currently supported in the Flink plug-in for aflux. This includes integrating Batch Analytics as well as the APIs of Stream Analytics that are not currently supported, like different notions of time (currently just event time is supported), which could be useful for some scenarios. Expand the flexibility of the Flink functionalities that are currently implemented, by integrating more configuration options, e.g. integrate custom triggers in window transformations. Improve the end-user continuous support for the sake of a better user experience. For instance, the flow validation could be triggered whenever a mashup component is removed, several flows be supported regardless of the order in which they are created, etc. Devise an unattended mechanism to deploy the generated jobs in a Flink instance. Right now, this has to be done manually. It is not a difficult process, but it would nonetheless make for a more integrated user experience. 100

128 A. Appendix A.1. SmartSantander Connector Listing A.1: Sample SmartSantander API Response 1 GET /api/rest/datasets/mediciones HTTP/1.1 2 Host: datos.santander.es 3 Accept: application/json HTTP/ OK 8 Content-Type: application/json 9 Content-Length: Server: Jetty(8.0.y.z-SNAPSHOT) { 13 "summary": { 14 "items": 465, 15 "items_per_page": 1000, 16 "pages": 1, 17 "current_page": 1 18 }, 19 "resources": [ 20 { 21 "ayto:ocupacion": "48", 22 "ayto:medida": "1001", 23 "ayto:idsensor": "1001", 24 "ayto:intensidad": "120", 25 "dc:modified": " T09:39:00Z", 26 "dc:identifier": "1001-3b12fe04-7ab7-11e8-aebe a43242", 27 "ayto:carga": "50", 28 "uri": " e8-aebe a43242.json" 29 }, ] 32 } 101

129 A. Appendix Table A.1.: Structure of the Traffic Dataset Attribute Name Attribute Type Description Traffic ayto:ocupacion Number (int) It stands for occupation. Time percentage that the transit loop is occupied by a vehicle. ayto:carga Number (int) It stands for charge. Estimation of congestion based on occupation and intensity (on a 100-basis). ayto:intensidad Number (int) It stands for intensity. Number of counted vehicles expanded to vehicles per hour (vph). ayto:idsensor Number (int) Unique identifier of the sensor. To be matched with the locations resource. dc:modified Text Date and time of the measurement, in ISO 8601 format [101]. dc:identifier Text Unique identifier of the measurement. 102

130 A.1. SmartSantander Connector Table A.2.: Structure of the Environment Dataset Attribute Name Attribute Type Description Environment ayto:type Text Type of the measurement. It can be either WeatherObserved or NoiseLevelObserved ayto:noise Number (decimal) Noise level specified in db. ayto:temperature Number (decimal) Temperature specified in C. ayto:light Number (decimal) Light intensity specified in lm. ayto:latitude Number (decimal) Latitude of the sensor location. ayto:longitude Number (decimal) Longitude of the sensor location. ayto:battery Number (decimal) Remaining battery in the sensor, expressed as a percentage. dc:modified Text Date and time of the measurement, in ISO 8601 format [101]. dc:identifier Number (int) Unique identifier of the sensor. 103

131 A. Appendix Table A.3.: Structure of the Air Quality Dataset Attribute Name Attribute Type Description Air Quality ayto:type Text Type of the measurement. It can only be AirQualityObserved ayto:no 2 Number (decimal) Level of NO 2 specified in µg m 3 ayto:co Number (decimal) Level of CO specified in mg m 3 ayto:ozone Number (decimal) Level of ozone specified in µg m 3 ayto:temperature Number (decimal) Temperature specified in C. ayto:latitude Number (decimal) Latitude of the sensor location. ayto:longitude Number (decimal) Longitude of the sensor location. dc:modified Number (decimal) Date and time of the measurement, in ISO 8601 format [101]. dc:identifier Number (decimal) Unique identifier of the sensor. 104

132 A.1. SmartSantander Connector Figure A.1.: Model used for the SmartSantander Connector 105

133 A. Appendix Figure A.2.: The SmartSantander Connector 106

134 A.2. Mashup Components A.2. Mashup Components Listing A.2: Generating an Anonymous Class with JavaPoet 1 // retrieve ClassName of types that will be used 2 ClassName mapfunctionclass = ClassName.get("org.apache.flink.api.common.functions", " MapFunction"); 3 ClassName environmentobservationclass = ClassName.get("org.apache.flink.streaming. connectors.smartsantander.model", "EnvironmentObservation"); 4 ClassName doubleclass = ClassName.get(Double.class); 5 6 MethodSpec.Builder mapmethodbuilder = MethodSpec.methodBuilder("map") 7.addAnnotation(Override.class) 8.addModifiers(Modifier.PUBLIC) 9.addParameter(environmentObservationClass, "observation") 10.addException(Exception.class) 11.returns(doubleClass); TypeSpec mapfunction = TypeSpec.anonymousClassBuilder("") 14.addSuperinterface(ParameterizedTypeName.get(mapFunctionClass, environmentobservationclass, doubleclass)) 15.addMethod(mapMethodBuilder.build()) 16.build(); // Assuming we have a CodeBlock.Builder named "code" in our scope 19 code.addstatement("$t mapfunction = $L", mapfunctionclass, mapfunction); Listing A.3: Output of Listing A.2 1 MapFunction mapfunction = new MapFunction<EnvironmentObservation, Double>() { 3 public Double map(environmentobservation observation) throws Exception { 4 return (Double)observation.getTemperature(); 5 } 6 }; 107

135 A. Appendix Table A.4.: Mashup Components in the Flink Plug-in for aflux Name Associated Classes Description Begin Job SmartSntndr Data GPS Filter Select Window Window Operation CEP Begin CEP Patt. New CEP Add Condition Node01EnvironmentSetup, EnvironmentSetUpActor Node02SmartSantanderData- Source, SmartSantanderData- SourceActor Node03TransformationFilter, TransformationFilterActor Node04TransformationMap, TransformationMapActor Node05TransformationWindow, TransformationWindowActor Node06TransformationWindow- Operation, Transformation- WindowOperationActor Node07OutputResult, ResultActor Node08CepPatternBegin, Output- CepPatternSequenceBegin- Actor Node10CepPatternAppend, PatternAppendActor Output Result Node09CepPatternCondition, CepPatternConditionActor Cep- Main configuration of the execution environment. Configure a data source from the SmartSantander API and create a data stream out of it. Apply a filter transformation on a data stream based on a specific location (latitude, longitude, radius). Apply a map transformation on a data stream to pick one of its attributes. Apply a window transformation on a data stream. Apply an operation on the elements of a windowed data stream. Print out a data stream. Begin the definition of a pattern sequence for CEP. Append a new pattern to the current pattern sequence. Add a new condition to the current pattern. CEP end Node11CepPatternEnd, Cep- PatternSequenceEndActor Finish the definition of a pattern sequence and configure behavior on match. End Job Node99ExecuteAndGenerateJob, ExecuteAndGenerateJobActor Finish the definition of the Flink job. 108

136 A.2. Mashup Components Listing A.4: ExecuteAndGenerateJobActor.java 1 public class ExecuteAndGenerateJobActor extends AbstractAFluxActor { 2 3 // rest of the class 4 6 protected void runcore(object message) throws Exception { 7 8 // Validate and cast message 9 FlinkFlowMessage msg; 10 try { 11 msg = FlinkFlowMessage.fromRawMessage(message); 12 } catch(illegalargumentexception e) { 13 this.sendoutput("error when receiving message from previous node."); 14 return; 15 } 16 CodeBlock.Builder code = msg.getcode(); 17 TypeName inputtype = msg.getcurrenttype(); 18 String inputvariablename = msg.getcurrentdatastreamvariablename(); // Add code: execute job 21 this.sendoutput("generating code for: job execution"); 22 code.addstatement("$l.$l()", 23 EnvironmentSetUpActor.GENERATED_CODE_VARIABLE_ENV, 24 API.getMethodName("StreamExecutionEnvironment.execute")); // Compute paths 27 String outputpath = System.getProperty("user.home"); // to extract template 28 String projectpath = String.join(File.separator, outputpath, MavenUtils. FLINK_TEMPLATE_PATH); // to access POM 29 String codepath = String.join(File.separator, 30 projectpath, "src", "main", "java"); // to output code extracttemplate(outputpath); // Extract the project template // Generate class code 35 JavaCodeGenerator codegen = new JavaCodeGenerator(code, codepath); 36 codegen.generatejavaclassfile(); 37 this.sendoutput("generating final Flink code"); // Generate packaged jar 40 this.sendoutput("generating final packaged Flink job"); 41 String result = ""; 42 try { 43 result = MavenUtils.runMavenCommand("clean package", projectpath); 44 } catch(exception e) { 45 this.sendoutput("an error occurred when building with Maven"); 46 } finally { 47 this.sendoutput(result); 48 } 49 } 50 } 109

137 A. Appendix Listing A.5: Basic Example with FlinkCEP 1 DataStream<Event> input = Pattern<Event,?> pattern = Pattern.<Event>begin("start").where( 4 new SimpleCondition<Event>() { 6 public boolean filter(event event) { 7 return event.getid() == 42; 8 } 9 } 10 ).next("middle").subtype(subevent.class).where( 11 new SimpleCondition<Event>() { 13 public boolean filter(subevent subevent) { 14 return subevent.getvolume() >= 10.0; 15 } 16 } 17 ).followedby("end").where( 18 new SimpleCondition<Event>() { 20 public boolean filter(event event) { 21 return event.getname().equals("end"); 22 } 23 } 24 ); PatternStream<Event> patternstream = CEP.pattern(input, pattern); DataStream<Alert> result = patternstream.select( 29 new PatternSelectFunction<Event, Alert> { 31 public Alert select(map<string, List<Event>> pattern) throws Exception { 32 return createalertfrom(pattern); 33 } 34 } 35 }); 110

138 A.3. Flink API Mapper A.3. Flink API Mapper Figure A.3.: The Flink API Mapper 111

139

140 Bibliography [1] Min Chen, Shiwen Mao, and Yunhao Liu. Big Data: A Survey. In: Mobile Networks and Applications 19.2 (Apr. 2014), pp issn: doi: /s [2] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In: Commun. ACM 51.1 (Jan. 2008), pp issn: doi: / [3] Muhammad Hussain Iqbal and Tariq Rahim Soomro. Big data analysis: Apache storm perspective. In: International journal of computer trends and technology 19.1 (2015), pp [4] Patricio Córdova. Analysis of real time stream processing systems considering latency. In: University of Toronto (2015). [5] A. S. Chhabra et al. Prediction for Big Data and IoT in In: 2017 International Conference on Infocom Technologies and Unmanned Systems (Trends and Future Directions) (ICTUS). Dec. 2017, pp doi: /ICTUS [6] M. Marjani et al. Big IoT Data Analytics: Architecture, Opportunities, and Open Research Challenges. In: IEEE Access 5 (2017), pp doi: / ACCESS [7] Tanmaya Mahapatra, Ilias Gerostathopoulos, and Christian Prehofer. Towards Integration of Big Data Analytics in Internet of Things Mashup Tools. In: Proceedings of the Seventh International Workshop on the Web of Things. WoT 16. Stuttgart, Germany: ACM, 2016, pp isbn: doi: / [8] Gartner. Gartner Identifies Three Megatrends That Will Drive Digital Business Into the Next Decade. Aug url: (visited on 06/2018). [9] Fabio Casati et al. Developing mashup tools for end-users: on the importance of the application domain. In: International Journal of Next-Generation Computing 3.2 (2012). [10] European Comission. SmartSantander url: project/rcn/95933_en.html (visited on 06/2018). [11] José M. Hernández-Muñoz et al. Smart Cities at the Forefront of the Future Internet. In: The Future Internet. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp

141 Bibliography [12] Jeffrey Dean and Sanjay Ghemawat. MapReduce: a flexible data processing tool. In: Communications of the ACM 53.1 (2010), pp [13] Saeed Shahrivari. Beyond batch processing: towards real-time and streaming big data. In: Computers 3.4 (2014), pp [14] Yongrui Qin et al. When things matter: A survey on data-centric internet of things. In: Journal of Network and Computer Applications 64 (2016), pp [15] G. M. D silva et al. Real-time processing of IoT events with historic data using Apache Kafka and Apache Spark with dashing framework. In: nd IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT). May 2017, pp doi: / RTEICT [16] Michael Stonebraker, Uǧur Çetintemel, and Stan Zdonik. The 8 requirements of real-time stream processing. In: ACM Sigmod Record 34.4 (2005), pp [17] Ellen Friedman and Kostas Tzoumas. Introduction to Apache Flink. O Reilly, Sept [18] The Apache Software Foundation. What is Apache Flink? Flink Applications url: https : / / flink. apache. org / flink - applications. html (visited on 06/2018). [19] Nathan Marz and James Warren. Big Data: Principles and best practices of scalable real-time data systems. New York; Manning Publications Co., [20] Mariam Kiran et al. Lambda architecture for cost-effective batch and speed big data processing. In: Big Data (Big Data), 2015 IEEE International Conference on. IEEE. 2015, pp [21] Jay Kreps. Questioning the Lambda Architecture.(2014). In: O Reilly Data Tools 2 (2014), p. 25. [22] Omer Berat Sezer et al. An extended iot framework with semantics, big data, and analytics. In: Big Data (Big Data), 2016 IEEE International Conference on. IEEE. 2016, pp [23] Nilamadhab Mishra, Chung-Chih Lin, and Hsien-Tsung Chang. A Cognitive Adopted Framework for IoT Big-Data Management and Knowledge Discovery Prospective. In: International Journal of Distributed Sensor Networks (2015), p doi: /2015/ eprint: [24] DMC Dissanayake and KPN Jayasena. A cloud platform for big IoT data analytics by combining batch and stream processing technologies. In: Information Technology Conference (NITC), 2017 National. IEEE. 2017, pp [25] Shusen Yang. IoT Stream Processing and Analytics in the Fog. In: IEEE Communications Magazine 55.8 (2017), pp

142 Bibliography [26] J. Jin et al. An Information Framework for Creating a Smart City Through Internet of Things. In: IEEE Internet of Things Journal 1.2 (Apr. 2014), pp issn: doi: /JIOT [27] Q. Wu et al. Cognitive Internet of Things: A New Paradigm Beyond Connection. In: IEEE Internet of Things Journal 1.2 (Apr. 2014), pp issn: doi: /JIOT [28] A. Katsifodimos and S. Schelter. Apache Flink: Stream Analytics at Scale. In: 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW). Apr. 2016, pp doi: /IC2EW [29] The Apache Software Foundation. Apache Flink Documentation, v url: (visited on 06/2018). [30] The Apache Software Foundation. Event Time. Flink DataStream API Programming Guide, v url: docsrelease-1.5/dev/event_time.html (visited on 06/2018). [31] The Apache Software Foundation. Windows. Flink DataStream API Programming Guide, v url: docsrelease-1.5/dev/stream/operators/windows.html (visited on 06/2018). [32] The Apache Software Foundation. Dataflow Programming Model, v url: https : / / ci. apache. org / projects / flink / flink - docs - release / concepts/programming-model.html (visited on 06/2018). [33] The Apache Software Foundation. Apache Flink R - Stateful Computations over Data Streams url: (visited on 06/2018). [34] S. Chintapalli et al. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). May 2016, pp doi: /IPDPSW [35] O. C. Marcu et al. Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER). Sept. 2016, pp doi: /CLUSTER [36] Florian Daniel and Maristella Matera. Mashups. Concepts, Models and Architecture. Springer, [37] Jin Yu et al. Understanding mashup development. In: IEEE Internet computing 5 (2008), pp [38] M. Q. Saleem, J. Jaafar, and M. F. Hassan. Model driven software development: An overview. In: 2014 International Conference on Computer and Information Sciences (ICCOINS). June 2014, pp doi: /ICCOINS

143 Bibliography [39] Yoon-Seop Chang et al. Study on mobile mashup webapp development tools for different devices and user groups. In: The International Conference on Information Networking 2014 (ICOIN2014). Feb. 2014, pp doi: /ICOIN [40] E. Lee and Hyung-Joo Joo. Developing lightweight context-aware service mashup applications. In: th International Conference on Advanced Communications Technology (ICACT). Jan. 2013, pp [41] A. Bozzon et al. A Conceptual Modeling Approach to Business Service Mashup Development. In: 2009 IEEE International Conference on Web Services. July 2009, pp doi: /ICWS [42] Tanmaya Mahapatra and Christian Prehofer. Service Mashups and Developer Support. In: Digital Mobility Platforms and Ecosystems (2016), p. 48. [43] Sehyeon Heo et al. IoT-MAP: IoT mashup application platform for the flexible IoT ecosystem. In: Internet of Things (IOT), th International Conference on the. IEEE. 2015, pp [44] F. Pramudianto et al. IoT Link: An Internet of Things Prototyping Toolkit. In: 2014 IEEE 11th Intl Conf on Ubiquitous Intelligence and Computing and 2014 IEEE 11th Intl Conf on Autonomic and Trusted Computing and 2014 IEEE 14th Intl Conf on Scalable Computing and Communications and Its Associated Workshops. Dec. 2014, pp doi: /UIC-ATC-ScalCom [45] B. Cheng et al. Lightweight Service Mashup Middleware with REST Style Architecture for IoT Applications. In: IEEE Transactions on Network and Service Management (2018), pp issn: doi: /TNSM [46] Project Consortium TUM Living Lab Connected Mobility. Digital Mobility Platforms and Ecosystems. Tech. rep. München: Software Engineering for Business Information Systems (sebis), July [47] Tanmaya Mahapatra et al. Stream Analytics in IoT Mashup Tools. In: Proceedings of the Visual Languages and Human Centric Computing. VL/HCC. Lisbon, Portugal, 2018, In Press. [48] Tanmaya Mahapatra, Christian Prehofer, and Manfred Broy. Service Mashups and Developer Support (Poster). In: Technical University Munich, Living Lab Connected Mobility Project Annual Event 2017 (2017). [49] Tanmaya Mahapatra and Christian Prehofer. Service Mashups and Developer Support. Sept. 18, [50] Lightbend, Inc. Actor Systems: Akka url: doc. akka. io/ docs/ akka/2.5/general/actor-systems.html (visited on 06/2018). 116

144 Bibliography [51] M. Daneva and B. Lazarov. Requirements for smart cities: Results from a systematic review of literature. In: th International Conference on Research Challenges in Information Science (RCIS). May 2018, pp doi: /RCIS [52] SmartSantander Project, FP7-ICT url: eu (visited on 06/2018). [53] IEEE Std : IEEE Standard for Low-Rate Wireless Networks. Standard. Institute of Electrical and Electronics Engineers, [54] The FIWARE Foundation. FIWARE Lab Context Management Platform. w016. url: FIWARE_Lab_Context_Management_Platform (visited on 06/2018). [55] SmartSantander Open Data access using FI-WARE G.E. [ORION]. White Paper. University of Cantabria. [56] Touk. Nussknacker. Streaming Processes Diagrams. url: nussknacker/ (visited on 06/2018). [57] Akka. How the Actor Model Meets the Needs of Modern, Distributed Systems url: (visited on 06/2018). [58] Wikipedia contributors. Directed acyclic graph Wikipedia, The Free Encyclopedia url: (visited on 06/2018). [59] Wikipedia contributors. Abstract Syntax Tree Wikipedia, The Free Encyclopedia url: (visited on 06/2018). [60] The Apache Software Foundation. Flink DataStream API Programming Guide, v url: ci. apache. org/ projects/ flink/ flink- docs- release- 1.5/dev/datastream_api.html (visited on 06/2018). [61] The FIWARE Foundation. Publish/Subscribe Context Broker - Orion Context Broker url: (visited on 06/2018). [62] The FIWARE Foundtion. Orion Context Broker url: https : / / fiware - orion.readthedocs.io/en/master/index.html (visited on 06/2018). [63] Santander City Council. Santander Open Data - REST API Documentation url: (visited on 06/2018). [64] The Apache Software Foundation. Apache Lucene - Query Parser Syntax url: (visited on 06/2018). [65] Oleg Kalnichevski, Jonathan Moore, and Jilles van Gurp. Apache HTTP Client. Reference Guide. The Apache Software Foundation, Jan

145 Bibliography [66] Google. Gson, a Java serialization/deserialization library to convert Java Objects into JSON and back. url: md (visited on 06/2018). [67] Jakob Jenkov. Java BlockingQueue. June url: com/java-util-concurrent/blockingqueue.html (visited on 06/2018). [68] ESRI Shapefile Technical Description. White Paper. Environmental Systems Research Institute, Inc., July [69] The Apache Software Foundation. Apache Commons CSV, version Sept url: csv/user- guide.html (visited on 06/2018). [70] Oracle. Java Type Erasure. url: java/generics/erasure.html (visited on 06/2018). [71] Zsolt Török. How to get a class instance of generics type T. Aug url: https: //stackoverflow.com/questions/ /how- to- get- a- class- instanceof-generics-type-t (visited on 06/2018). [72] Square, Inc. JavaPoet, a Java API for generating.java source files. May url: (visited on 06/2018). [73] Square, Inc. JavaWriter, a utility class which aids in generating Java source files. Feb url: md (visited on 06/2018). [74] The Apache Software Foundation. Apache Commons Lang, version 3.7. Nov url: (visited on 06/2018). [75] The Apache Software Foundation. Flink Basic API Concepts, v url: https: / / ci. apache. org / projects / flink / flink - docs - release / dev / api _ concepts.html (visited on 06/2018). [76] The Apache Software Foundation. Pre-defined Timestamp Extractors & Watermark Emitters. Flink DataStream API Programming Guide, v url: https : / / ci. apache. org / projects / flink / flink - docs - release / dev / event _ timestamp_extractors.html (visited on 06/2018). [77] Tanmay Deshpande. Learning Apache Flink. Packt Publishing, Feb [78] The Apache Software Foundation. DataStream Transformations. Flink DataStream API Programming Guide, v url: ci. apache. org/ projects/ flink/flink-docs-release-1.5/dev/stream/operators/index.html (visited on 06/2018). [79] Ramesh Syangtan. React Location Picker. May url: rameshsyn/react-location-picker/readme.md (visited on 06/2018). 118

146 Bibliography [80] Wikipedia contributors. Haversine Formula Wikipedia, The Free Encyclopedia url: (visited on 06/2018). [81] David George and Neeme Praks. Calculating distance between two points, using latitude longitude, what am I doing wrong? Apr url: com/questions/ /calculating-distance-between-two-points-usinglatitude-longitude-what-am-i-doi (visited on 06/2018). [82] The Apache Software Foundation. State. Flink DataStream API Programming Guide, v url: https : / / ci. apache. org / projects / flink / flink - docs - release- 1. 5/ dev/ stream/ state/ state. html# keyed- state- and- operatorstate (visited on 06/2018). [83] A. Marques da Silva Cardoso et al. Poster Abstract: Real-Time DDoS Detection Based on Complex Event Processing for IoT. In: 2018 IEEE/ACM Third International Conference on Internet-of-Things Design and Implementation (IoTDI). Apr. 2018, pp doi: /IoTDI [84] A. Alakari, K. F. Li, and F. Gebali. Complex event processing enrichment: Motivations and challenges. In: 2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM). Aug. 2017, pp doi: /PACRIM [85] The Apache Software Foundation. FlinkCEP. Complex event processing for Flink, v url: https : / / ci. apache. org / projects / flink / flink - docs - release-1.5/dev/libs/cep.htmls (visited on 06/2018). [86] S. Choochotkaew et al. EdgeCEP: Fully-Distributed Complex Event Processing on IoT Edges. In: th International Conference on Distributed Computing in Sensor Systems (DCOSS). June 2017, pp doi: /DCOSS [87] R. Raj et al. Real time complex event processing and analytics for smart building. In: 2017 Conference on Information and Communication Technology (CICT). Nov. 2017, pp doi: /INFOCOMTECH [88] The Apache Software Foundation. Flink Project Template for Java, v url: https : / / ci. apache. org / projects / flink / flink - docs - release / quickstart/java_api_quickstart.html (visited on 06/2018). [89] Juan José Zapico. Extract the Contents of ZIP/JAR Files Programmatically. Aug url: (visited on 06/2018). [90] The Apache Software Foundation. Apache Maven Invoker, version May url: (visited on 06/2018). [91] Mark A. Overton. The IDAR Graph. In: Queue 15.2 (Apr. 2017), 20:29 20:48. issn: doi: /

147 Bibliography [92] Wikipedia contributors. Singleton Pattern Wikipedia, The Free Encyclopedia url: (visited on 06/2018). [93] Nicholas Smith, Danny van Bruggen, and Federico Tomassetti. JavaParser: Visited. Lean Publisher, Apr [94] Danny van Bruggen. JavaParser, for processing Java code url: http : / / javaparser.org/ (visited on 06/2018). [95] Thomas Kuhn and Olivier Thomann. Eclipse Abstract Syntax Tree url: www. eclipse. org/ articles/ Article- JavaCodeManipulation_ AST/ (visited on 06/2018). [96] Wikipedia contributors. Many-to-many (data model) Wikipedia, The Free Encyclopedia url: to- many_(data_ model) (visited on 06/2018). [97] Oracle. Bounded Type Parameters. url: https : / / docs. oracle. com / javase / tutorial/java/generics/bounded.html (visited on 06/2018). [98] Scott Ambler. UML 2 Use Case Diagrams: An Agile Introduction url: (visited on 06/2018). [99] The Apache Software Foundation. Local Flink Cluster. Flink Quickstart, v url: docs- release- 1.5/ quickstart/setup_quickstart.html#start- a- local- flink- cluster (visited on 06/2018). [100] The Apache Software Foundation. FlinkML. Machine Learning for Flink, v url: libs/ml/ (visited on 06/2018). [101] ISO 8601:2004: Data elements and interchange formats. Information interchange. Representation of dates and times. Standard. International Organization for Standardization, Dec

148 UNIVERSIDAD POLITÉCNICA DE MADRID ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN

ANX-PR/CL/ LEARNING GUIDE

ANX-PR/CL/ LEARNING GUIDE PR/CL/001 SUBJECT 103000693 - DEGREE PROGRAMME 10AQ - ACADEMIC YEAR & SEMESTER 2018/19 - Semester 1 Index Learning guide 1. Description...1 2. Faculty...1 3. Prior knowledge recommended to take the subject...2