D3.8 First Prototype of the Real-time Scheduling Advisor

Size: px

Start display at page:

Download "D3.8 First Prototype of the Real-time Scheduling Advisor"

Curtis Gilbert
6 years ago
Views:

Project Number 318763 D3.8 First Prototype of the Real-time Scheduling Advisor Version 1.

SOFTEAM, Scuola Superiore Sant Anna, The Open Group, University of Stuttgart, University of

herein are accurate, however the JUNIPER Project Partners accept no liability for any error

1 Project Number D3.8 First Prototype of the Real-time Scheduling Advisor Version 1.0 Final EC Distribution Brno University of Technology Project Partners: aicas, HMI, petafuel, SOFTEAM, Scuola Superiore Sant Anna, The Open Group, University of Stuttgart, University of York Every effort has been made to ensure that all statements and information contained herein are accurate, however the JUNIPER Project Partners accept no liability for any error or omission in the same Copyright in this document remains vested in the JUNIPER Project Partners.

2 Project Partner Contact Information aicas HMI Fridtjof Siebert Markus Schneider Haid-und-Neue Strasse 18 Im Breitspiel 11 C Karlsruhe Heidelberg Germany Germany Tel: Tel: siebert@aicas.com schneider@hmi-tec.com petafuel SOFTEAM Ludwig Adam Andrey Sadovykh Muenchnerstrasse 4 Avenue Victor Hugo Freising Paris Germany France Tel: Tel: ludwig.adam@petafuel.de andrey.sadovykh@softeam.fr Scuola Superiore Sant Anna The Open Group Mauro Marinoni Scott Hansen via Moruzzi 1 Avenue du Parc de Woluwe Pisa 1160 Brussels Italy Belgium Tel: Tel: m.marinoni@sssup.it s.hansen@opengroup.org University of Stuttgart University of York Bastian Koller Neil Audsley Nobelstrasse 19 Deramore Lane Stuttgart York YO10 5GH Germany United Kingdom Tel: Tel: koller@hlrs.de neil.audsley@cs.york.ac.uk Brno University of Technology Pavel Smrž Božetěchova Brno Czech Republic Tel: smrz@fit.vutbr.cz Page ii Version 1.0

3 Contents 1 Introduction Overview and Context Stream Processing on Heterogeneous Clusters Scheduling Decisions in Stream Processing on Heterogeneous Clusters Related Work Objectives and Structure of this Document Scheduling in the JUNIPER Project Motivation Motivating Example Scheduling in the Motivating Example Integration Scenario Profiling and Benchmarking Profiling Benchmarking The Scheduling Advisor Specification Inputs/Outputs and Required/Provided Interfaces Architecture and Prototype Implementation of the Scheduling Advisor Evaluation of the Scheduling Advisor Prototype Conclusion and Future Work Summary and Conclusions Future Work Version 1.0 Page iii

4 List of Figures 1 Architecture of the sample JUNIPER application the graph of JUNIPER programs interconnected by communication channels The non-optimal deployment on the sample JUNIPER application The nearly optimal deployment on the sample JUNIPER application The sequence of the individual steps in development and deployment of a JUNIPER application as they are described in the integration scenario in this section Architecture of the scheduling advisor Scheduler performance comparison cumulative results of multiple tests based on number of tuples processed in time interval Page iv Version 1.0

5 Document Control Version Status Date 0.1 Document outline 25 April Draft of integration scenario 30 April Updated document outline after internal review by project partners 13 May Making the document ready for internal review 16 May First draft for internal review 19 May Final modifications 26 May QA for EC delivery Version 1.0 Page v

6 Page vi Version 1.0

7 Executive summary This document constitutes deliverable 3.8 First Prototype of the Real-time Scheduling Advisor of work package 3 of the JUNIPER project. The purpose of this deliverable is to describe background research, specification, and architecture of the scheduling advisor, its integration into the JUNIPER project, and implementation of its first prototype. The scheduling advisor described in this deliverable will provide an architect, modeler, and developer of a JUNIPER application, or a system administrator of a JUNIPER platform, with information suitable for optimization of the application or the platform, respectively. More specifically, the scheduling advisor performs profiling and benchmarking of various deployments of the application on the platform in order to find the best possible deployment. It provides profiling results of the JUNIPER application and benchmarking results for its individual JUNIPER programs for the various deployments of the application (i.e., placement of the programs), measures the longest observed execution time of individual JUNIPER programs in all possible deployments to improve schedulability analysis, and finds the best possible deployment of the JUNIPER application on the particular platform which indicates the optimal deployment for production use. Deliverable 3.8 which is described in this document is the first step to the final version of the scheduling advisor, which will be described and submitted later in deliverable 3.9 in the third year of the project. Version 1.0 Page 1

8 1 Introduction 1.1 Overview and Context Stream processing is a paradigm evolving in response to well-known limitations of widely adopted MapReduce paradigm for distributed Big Data batch processing, which is well recognized by the JUNIPER project in deliverable Efficient and real-time processing of large streaming data sources and stored data in the Big Data systems targeted by the JUNIPER project requires the underlying platform to provide applications with: high availability and optimal performance to be able to process the continuously incoming data streams in real time and to fulfill the time-constraints imposed by the applications; guaranteed resources to be able to utilize the resources in the processing as necessary and without latencies; and high scalability to be able to adapt to variability in the quality of the incoming data streams, to adapt to different resource availability in heterogeneous clusters, and to provide solutions scalable enough to cater for growing data stream rates in general. Due to these requirements, the underlying platform has to employ advanced resource allocation and scheduling techniques to deploy the applications in optimal way. The resource allocation deals mainly with a problem of gathering and assigning resources to the different requesters while scheduling cares about which tasks and when to place on which previously obtained resources. In the JUNIPER project, the main effort in the resource allocation and scheduling is concentrated in work package 3 where the required system software needed for the JUNIPER project is developed. The effort in work package 3 includes the software support for a micro-scheduling and resource reservation for processes at the operating system level as well as for a high-level scheduling and resource allocation for distributed tasks deployed to individual nodes across a large shared cluster, i.e. in a cloud environment. In order to support hierarchical scheduling in cloud settings and environments in which JUNIPER s intended applications operate, this deliverable discusses the design and prototype implementation of a novel real-time scheduling advisor which is developed in task 3.5 in work package 3. The scheduling advisor will provide developers of JUNIPER applications with information necessary to optimize resource allocation and scheduling of the applications in cloud environments. This includes information on performance of various deployments including the worst-case deployment, profiling information on the applications and their individual components, benchmarking information on the deployment platforms running the applications, etc. This document is a part of deliverable 3.8 and it describes development of the first prototype of the scheduling advisor mentioned above. It is a first step into the implementation of a final version of the scheduling advisor in future deliverable 3.9 to be produced in the third year of the project. The present document specifically focuses on the integration aspects of the scheduling advisor into the JUNIPER project by description of an integration scenario employing products of other deliverables 1 Despite the stream processing is the processing of continuous data streams, it can also easily process stored input data originally intended for the batch processing by sending them in a stream of limited length. Page 2 Version 1.0

9 and by definitions of inputs and outputs of the advisor as well as its required and provided interfaces implemented by the other deliverables. In the general context of the JUNIPER project, this deliverable is highly influenced by other deliverables produced mainly by work packages 2, 3, and 5, during the first year. More specifically, the scheduling advisor utilizes the concept of programming model presented in deliverable 2.1 and the representation of this programming model presented in deliverable 5.2, along with the description of schedulability analysis in deliverable 3.1. In design of the scheduling advisor, we also take into account deliverables 2.2, 2.3, and 3.4, which discuss optimization of Java applications by FPGA and introduce high heterogeneity into distributed processing by the utilization of cluster nodes equipped with FPGA for accelerated computation of particular tasks of a distributed application. 1.2 Stream Processing on Heterogeneous Clusters In homogeneous computing environments, all nodes have identical performance and capacity. Resources can be allocated evenly across all available nodes and effective task scheduling is determined by quantity of the nodes, not by their individual quality. Typically, resource allocation and scheduling in the homogeneous computing environments balance of workload across all the nodes which should have identical workload. Contrary to the homogeneous computing environments, there are different types of nodes in a heterogeneous cluster with various computing performance and capacity. High-performance nodes can complete the processing of identical data faster than low-performance nodes. Moreover, the performance of the nodes depends on the character of computation and on the character of input data. For example, graphic-intensive computations will run faster on nodes that are equipped with powerful GPUs while memory-intensive computation will run faster on nodes with large amount of RAM or disk space. To balance workload in a heterogeneous cluster optimally, a scheduler has to (1) know performance characteristics for individual types of nodes employed in the cluster for different types of computations and to (2) know or to be able to analyze computation characteristics of incoming tasks and input data. The first requirement, i.e., the performance characteristics for individual types of employed nodes, means the awareness of infrastructure and topology of a cluster including detailed specification of its individual nodes. In the most cases, this information is provided at the cluster design-time by its administrators and architects. Moreover, the performance characteristics for individual nodes employed in a cluster can be adjusted at the cluster s run-time based on historical data of performance monitoring and their statistical analysis of processing different types of computations and data by different types of nodes. The second requirement is the knowledge or the ability to analyze computation characteristics of incoming tasks and input data. In batch processing, tasks and data in a batch can be annotated or analyzed in advance, i.e., before the batch is executed, and acquired knowledge can be utilized in optimal allocation of resources and efficient task scheduling. In stream processing, the second requirement is much more difficult to meet due to continuous flow and unpredictable variability of the input data which make thorough analysis of computation characteristics of the input data and incoming tasks impossible, especially with real-time limitations in their processing. Version 1.0 Page 3

10 To address the above mentioned issues of stream processing in heterogeneous clusters with optimal performance, user-defined tasks processing (at least some) of the input data has to help the scheduler. For example, an application may include user-defined helper-tasks tagging input data at run-time by their expected computation characteristics for better scheduling 2. Moreover, individual tasks of a stream application should be tagged at design-time according to their required computation resources and real-time constraints on the processing to help with their future scheduling. Implementation of the mentioned tagging of tasks at design-time should be part of modeling (a meta-model) of topology and infrastructure of such applications and it is described in deliverable 5.1 Foundations for MDE of Big Data Oriented Real-Time Systems. With the knowledge of the performance characteristics for individual types of nodes employed in a cluster and with the knowledge or the ability to analyze computation characteristics of incoming tasks and input data, a scheduler has enough information for balancing workload of the cluster nodes and optimizing throughput of an application. Related scheduling decisions, e.g., re-balancing of the workload, are usually done periodically with an optimal frequency. An intensive re-balancing of the workload across the nodes can cause high overhead while an occasional re-balancing may not utilize all nodes optimally. 1.3 Scheduling Decisions in Stream Processing on Heterogeneous Clusters Schedulers make their decisions on a particular level of abstraction. They do not try to schedule all live tasks to all possible cluster nodes but just deal with units of equal or scalable size. For example, the YARN scheduler [14] uses Containers with various amounts of cores and memory and Apache Storm [2] uses Slots of equal size (one slot per CPU core) where, in each slot, multiple spouts or blots of the same topology may run. One of the important and commonly adopted scheduler decisions is data locality. For instance, the main idea of MapReduce is to perform computations by the nodes where the required data are saved to prevent intensive data loading to and removing from a cluster. Data locality decisions from the stream processing perspective are different because processors usually does not operate on data already stored in a processing cluster but rather on streams coming from remote sources. Thus, in stream processing, we consider the data locality to be an approach to minimal communication costs which results, for example, in scheduling of the most communicating processor instances together to the same node or the same rack. The optimal placement of tasks across cluster nodes may, moreover, depend on other requirements beyond the communication costs mentioned above. Typically, we are talking about CPU performance or overall node performance that makes the processing faster. For example, the performance optimization may lie in detection of tasks which are exceedingly slow in comparison to the others with the same signature. More sophisticated approaches are based on various kinds of benchmarks performed on each node in a cluster while the placement of a task is decided with respect to its detected performance on a particular node or a class of nodes. Furthermore, the presence of resources, e.g., GPU or FPGA, can be taken into account. 2 e.g., parts of variable-bit-rate video streams with temporary high bit-rate will be tagged for processing by special nodes with powerful video decoders, while average bit-rate parts can be processed by common nodes Page 4 Version 1.0

11 There are two essential kinds of scheduling decisions: off-line decisions and on-line decisions. The former is based on the knowledge the scheduler has before any task is placed and running. In context of stream processing, this knowledge is mostly the topology and the off-line decisions can, for example, consider communication channels between nodes. On-line decisions are made with information gathered during the actual execution of an application, i.e., after or during initial placement of its tasks over a cluster nodes. So the counterpart for the off-line topology based communication decision is decision derived from real bandwidths required between running processor instances [2]. In effect, the most of scheduling decisions in stream processing are made on-line or based on historical on-line data. 1.4 Related Work Over the past decade, stream processing has been the subject of much research. Existing approaches can essentially be categorized by scalability into centralized, distributed, and massively-parallel stream processors. In this section, we will focus mainly on distributed and massively-parallel stream processors but also on their successors exploiting ideas of the MapReduce paradigm in the context of the stream processing. In distributed stream processors, related work is mainly based on Aurora* [6], which has been introduced for scalable distributed processing of data streams. An Aurora* system is a set of Aurora* nodes that cooperate via an overlay network within the same administrative domain. The nodes can freely relocate load by decentralized, pairwise exchange of the Aurora stream operators. Sites running Aurora* systems from different administrative domains can be integrated into a single federated system by Medusa [6]. Borealis [1] has introduced a refined QoS optimization model for Aurora*/Medusa where the effects of load shedding on QoS can be computed at every point in the data flow, which enables better strategies for load shedding. Massively-parallel data processing systems, in contrast to the distributed (and also centralized) stream processors, have been designed to run on and efficiently transfer large data volumes between hundreds or even thousands of nodes. Traditionally, those systems have been used to process finite blocks of data stored on distributed file systems. However, newer systems such as Dryad [8], Hyracks [4], CIEL [12], DAGuE [5], or Nephele framework [15] allow to assemble complex parallel data flow graphs and to construct pipelines between individual parts of the flow. Therefore, these parallel data flow systems in general are also suitable for the streaming applications. The latest related work is based mainly on the MapReduce paradigm or its concepts in the context of stream processing. At first, Hadoop Online [7] extended the original Hadoop by ability to stream intermediate results from map to reduce tasks as well as the possibility to pipeline data across different MapReduce jobs. To facilitate these new features, the semantics of the classic reduce function has been extended by time-based sliding windows. Li et al. [10] picked up this idea and further improved the suitability of Hadoop-based systems for continuous streams by replacing the sort-merge implementation for partitioning by a new hash-based technique. The Muppet system [9], on the other hand, replaced the reduce function of MapReduce by a more generic and flexible update function. S4 [13] and Apache Storm [11], which is used in this paper, can also be classified as massivelyparallel data processing systems with a clear emphasis on low latency. They are not based on MapRe- Version 1.0 Page 5

12 duce but allows developers to assemble arbitrarily a complex directed acyclic graph (DAG) of processing tasks. For example, Storm does not use intermediate queues to pass data items between tasks. Instead, data items are passed directly between the tasks using batch messages on the network level to achieve a good balance between latency and throughput. The distributed and massively-parallel stream processors mentioned above usually do not explicitly solve adaptive resource allocation and task scheduling in heterogeneous environments. For example, in [3], authors demonstrate how Aurora*/Medusa handles time-varying load spikes and provides high availability in the face of network partitions. They concluded that Medusa with the Borealis extension does not distribute load optimally but it guarantees acceptable allocations; i.e., either no participant operates above its capacity, or, if the system as a whole is overloaded, then all participants operate at or above capacity. The similar conclusions can be done also in the case of the previously mentioned massively-parallel data processing systems. For example, DAGuE does not target heterogeneous clusters which utilize commodity hardware nodes but can handle intra-node heterogeneity of clusters of supercomputers where a runtime DAGuE scheduler decides at runtime which tasks to run on which resources [5]. Another already mentioned massively-parallel stream processing system, Dryad [8], is equipped with a robust scheduler which takes care of nodes liveness and rescheduling of failed jobs and tracks execution speed of different instances of each processor. When one of these instances under-performs the others, a new instance is scheduled in order to prevent slowdowns of the computation. Dryad scheduler works in greedy mode, it does not consider sharing of cluster among multiple systems. Finally, in the case of approaches based on the MapReduce paradigm or its concepts, resource allocation and scheduling of stream processing on heterogeneous clusters is necessary due to utilization of commodity hardware nodes. In stream processing, data placement and distribution are given by a user-defined topology (e.g., by pipelines in Hadoop Online [7], or by a DAG of interconnected spouts and bolts in Apache Storm [11]). Therefore, approaches to the adaptive resource allocation and scheduling have to discuss initial distribution and periodic re-balancing of workload (i.e., tasks, not data) across nodes according to different processing performance and specialization of individual nodes in a heterogeneous cluster. For instance, S4 [13] uses Apache ZooKeeper to coordinate all operations and for communication between nodes. Initially, a user defines in ZooKeeper which nodes should be used for particular tasks of a computation. Then, S4 employs some nodes as backups for possible node failure and for load balancing. Adaptive scheduling in Apache Storm has been addressed in [2] by two generic schedulers that adapt their behavior according to a topology and a run-time communication pattern of an application. Experiments shown improvement in latency of event processing in comparison with the default Storm scheduler, however, the schedulers do not take into account the requirements discussed in the beginning of Section 1.2, i.e., the explicit knowledge of performance characteristics for individual types of nodes employed in a cluster for different types of computations and the ability to analyze computation characteristics of incoming tasks and input data. By implementation of these requirements, efficiency of the scheduling can be improved. Page 6 Version 1.0

13 1.5 Objectives and Structure of this Document The objectives of this document are based on the Description of Work for task 3.5 of the JUNIPER project and can be summarized as follows: 1. To analyze hierarchical scheduling approaches in cloud settings dealing with virtualized distributed systems that dominate the cloud environments in which JUNIPER s intended applications operate. 2. To extend the current scheduling approaches to address the JUNIPER s focus on real-time in Big Data application contexts. 3. To design and to implement a novel real-time scheduling advisor for the JUNIPER s applications. Section 1 addresses the first objective and introduces the state of the art of scheduling of stream processing on heterogeneous clusters. Such processing is characteristic for the JUNIPER project due to both the processing of Big Data streams and the heterogeneity of distributed processing cluster utilizing commodity hardware nodes as well as specialized nodes with FPGAs. Section 2 addresses the second objective and deals with particular needs of the JUNIPER project in the scheduling of stream processing on heterogeneous clusters which should be addressed by the scheduling advisor. Finally, Section 3 describes the first prototype of the scheduling advisor, its specification, architecture, design and implementation, and possible evaluation, to address the third objective. The document is concluded by Section 4 which sums up contributions of the deliverable 3.8 described in this document and outlines ongoing and future work towards the next deliverable 3.9 that will provide the final version of the scheduling advisor. Version 1.0 Page 7

14 2 Scheduling in the JUNIPER Project 2.1 Motivation According to deliverable 2.1, each JUNIPER application is composed of individual JUNIPER programs performing particular tasks. These JUNIPER programs are annotated interconnected with outside environment via input/output and request/response channels or streams and between themselves via communication channels. The responsibility and the goal of a scheduler in the JUNIPER platform is to deploy individual JUNIPER programs, or the JUNIPER application in general, in the most optimal way where data processing by the JUNIPER application will be the fastest (the JUNIPER application will have the maximal performance). Currently, there is not known an existing scheduler for distributed stream processing platform (see Section 1.4) which would implement desired features, such as the support for heterogeneity of a deployment platform/cluster or integration of profiling and benchmarking information into scheduling decisions. The scheduling advisor, which is described in this document, should improve scheduling decisions in the scheduler and provide useful advices to developers of JUNIPER applications enable them to optimize the application performance and efficient utilization of resources required by the application Motivating Example To demonstrate the scheduling problems anticipated in the current state of the art of stream processing on heterogeneous clusters, i.e. the problems which can be met in the JUNIPER project, a sample application is presented in this section. The application Popular Stories implements a use case of processing of continuous stream of web-pages from thousands of RSS feeds. It analyses the webpages in order to find texts and photos identifying the most popular connections between persons and related keywords and pictures. The result is a list of triples (a person s name, a list of keywords, and a set of photos) with meaning: a person recently frequently mentioned in context of the keywords (e.g., events, objects, persons, etc.) and the photos. The application holds a list of triples with the most often seen persons, keywords, and pictures for some period of time. This way, current trends of persons related to keywords with relevant photos can be obtained 3. As a JUNIPER application, the sample application is composed of its components which are JUNIPER programs as it is depicted in Figure 1. Stream processing performed by the application starts with the URL generator program which extracts URLs of web-pages from incoming RSS feeds by an input channel. After that, the Downloader program gets the (X)HTML source, styles, and pictures of each web-page and encapsulates them into a stand-alone message. The message is passed via a communication channel to the Analyzer program, which searches the web-page for person names and for keywords and pictures in context of the names found. The resulting pairs of name-keyword are sent via a communication channel and stored in tops lists in the In-memory store NK program 3 The application has been developed for demonstration purposes only, particularly to evaluate the scheduling advisor developed for the JUNIPER project and described in the paper. However, the application may be used also in practice, e.g., for visualization of popular news on persons (with related keywords and photos) from news feeds on the Internet. Page 8 Version 1.0

where the store is updated each time a new pair arrives and where excessively old pairs are removed from computation of the list.

All changes in the tops list are passed via a communication channel to the In-memory store NKP program.

features of each image which allow later to detect different instances of the same pictures (e.g., the same photo in different resolution or with different cropping).

15 where the store is updated each time a new pair arrives and where excessively old pairs are removed from computation of the list. In other words, the window of a time period is held for the tops list. All changes in the tops list are passed via a communication channel to the In-memory store NKP program. Moreover, pairs of name-picture emitted by the Analyzer program are sent via a communication channel and processed by the Image feature extractor program to get indexable features of each image which allow later to detect different instances of the same pictures (e.g., the same photo in different resolution or with different cropping). Then, the image features are sent via a communication channel to the In-memory store NP program where the tops list of the most popular persons and related unique images pairs is held. Finally, all modifications in the tops list of the In-memory store NP program are emitted via a communication channel to the In-memory store NKP program which maintains a consolidated tops list of persons with related keywords and pictures. This tops list is persistent, i.e. it is saved in the Persistent storage, and available for further querying via input/output channels of the JUNIPER application. Figure 1: Architecture of the sample JUNIPER application the graph of JUNIPER programs interconnected by communication channels Version 1.0 Page 9

16 The sample application described above is designed as the JUNIPER application, however, it can be also implemented in other frameworks for distributed stream processing, e.g., as an Apache Storm application. Nevertheless, the application utilizes technologies of the Hadoop platform, e.g., in-memory stores and the final persistent storage in the application employ search engine Apache Lucene 4 with distributed Hadoop-based storage Katta 5 for Lucene indexes to detect different instances of the same pictures as mentioned above. For testing purposes, for an initial evaluation of different scheduling approaches, and finally also for the evaluation of the scheduling advisor described in this document, the sample application has been initially implemented in Apache Storm. Results of the evaluation are presented later in this document Scheduling in the Motivating Example When a JUNIPER application is deployed on a deployment platform, individual JUNIPER programs of the application need to be instantiated and their instances need to be executed on individual nodes of the target platform. This is the task for a scheduler of the target platform, which is controlling instantiation of the JUNIPER programs and the placement of their instances on the available platform nodes, utilizing resources previously allocated to the JUNIPER application. In the case of deployment of a JUNIPER application on a target platform of a heterogeneous cluster, scheduling is not trivial and may result into non-optimal JUNIPER program instantiations and subsequent placement. Let us have a heterogeneous cluster that consists of four nodes with different hardware configurations, more specifically, the nodes with fast CPU, slow CPU, lots of memory, and GPU equipped. Figure 2 is depicting a non-optimal scheduling of the sample application described in Section due to the lack of knowledge about the deployed application and the deployment target platform. After this scheduling, the Analyzers programs are requiring the CPU performance were placed to the node with lots of memory while the memory greedy In-memory stores programs were scheduled to the nodes with powerful GPU and slow CPU which led to the need of higher level of parallelism of MS NP. Moreover, the fast CPU node runs the undemanding Downloaders and the URL generator program. Finally, the Image extractor programs were placed to the slow CPU node and the high memory node. Therefore, it is obvious that the scheduling decisions taken were relatively wrong and the scheduling results into inefficient utilization of the cluster. Contrary to the scheduling described above, the scheduling of the sample JUNIPER application depicted in Figure 3 is quite optimal. The In-memory store programs were deployed on the node with high amount of memory and the Image feature extractor programs were deployed on the node with two GPUs so it was be possible to reduce parallelism. The undemanding Downloaders programs were placed on the Slow CPU node and Analyzers utilize the Fast CPU node Page 10 Version 1.0

17 Downloader; Ax Analyzer; IEx Image feature extractor; MSx In-memory store Figure 2: The non-optimal deployment on the sample JUNIPER application 2.2 Integration Scenario Development and deployment of a JUNIPER application requires cooperation of different roles of responsible participants in the development process. Moreover, in the JUNIPER project, there are several specialized tools, namely the scheduling advisor and the schedulability analysis tools, which help the involved participants with the development and deployment. To describe the cooperation of the roles in the development process and possible utilization of the mentioned tools, we define the following well-established roles: an analyst, who describes required functionality of a JUNIPER application, an architect, who designs the application architecture according to the JUNIPER platform, a modeler, who describes the architecture and design the application in details according to the JUNIPER programming model, a developer, who implements the application utilizing the JUNIPER application interface (API). a system administrator, who deploys the application to a particular cluster running the JUNIPER platform. The development process of a JUNIPER application and the utilization of JUNIPER tools in the the development and deployment of the application by the above mentioned roles is depicted in Figure 4. The individual steps of the development process are described in the following integration scenario: 1. An analyst describes the specification of a JUNIPER application. Version 1.0 Page 11

18 Downloader; Ax Analyzer; IEx Image feature extractor; MSx In-memory store Figure 3: The nearly optimal deployment on the sample JUNIPER application 2. An architect/modeler creates a bare model of the JUNIPER application (which is composed of Juniper programs). The bare model describes: logical architecture of the JUNIPER application (a graph of JUNIPER programs interconnected by communication channels and connected with the outside world by I/O channels and request/response streams, as it is described in deliverable 2.1), data structures (data types) for communication channels, I/O channels, and request/response streams, utilized by the JUNIPER application, expected data rates (frequencies) for processing data coming from the communication channels, I/O channels, and request/response streams, utilized by the JUNIPER application, real-time constraints of (real-time-)critical parts of the JUNIPER application, i.e., of subsets of all Juniper programs in the JUNIPER application. 3. A developer implements the JUNIPER application, i.e., he/she implements and integrates individual Juniper programs of the JUNIPER application. 4. A system administrator defines a platform (a cluster, e.g., BonFIRE) where the JUNIPER application will be eventually deployed, including information about the (possible) allocation of resources within the platform and about the scheduling algorithms used within the single (cluster) nodes. Page 12 Version 1.0

19 Developer Model application Implement application Run schedulability analysis 8 Define degree of parallelism 11 Administrator 4 Define platform Scheduling advisor Optimize deployment 12 Profile application 12.3 Deploy application Numbers in black boxes refer to numbered parts of description. Figure 4: The sequence of the individual steps in development and deployment of a JUNIPER application as they are described in the integration scenario in this section 5. The developer, in cooperation with the system administrator, determines the longest observed execution time for each JUNIPER program of the JUNIPER application running on the defined platform, i.e., the execution time in the case of placement to the slowest platform node, when processing the most difficult input data, performing the slowest I/O operations, the most intensive resource allocations, etc. 6. The architect/modeler/developer uses the schedulability analysis tools to get information on possible schedulability of the JUNIPER application. The schedulability analysis tools will process the following inputs: the bare model including real-time constraints of (real-time-)critical parts of the JUNIPER application, specified by the modeler in the second step of this scenario, the information about the (possible) allocation of resources within the platform and about the scheduling algorithms used within the single (cluster) nodes, provided by the system administrator in the fourth step of this scenario, the longest observed execution time of each JUNIPER program of the JUNIPER application, provided by the developer in cooperation with the system administrator in the fifth step of this scenario, or by the scheduling advisor which can perform a profiling on each Version 1.0 Page 13

20 single JUNIPER program running alone (without any concurrency) as it is described later in step 13 of this scenario. 7. The schedulability analysis tools compute from the inputs, which are described in the previous step, the following results/outputs: an upper-bound on the response time of the overall stream processing in the cases when the real-time constraints specified in the inputs of the schedulability analysis are met, i.e., the response time of the JUNIPER application processing a particular piece of data from its input to its output, an upper-bound on the response time of each individual JUNIPER program of the JUNIPER application in the cases when the real-time constraints specified in the inputs of the schedulability analysis are met, i.e., the response time of each JUNIPER program processing a particular piece of data from its input to its output, a sensitivity analysis, i.e. information on how to modify the design of the JUNIPER application and the allocation/scheduling phase of the JUNIPER programs to improve the real-time performance and response time. 8. The architect/modeler/developer takes into account results of the schedulability analysis and perform the following activities to finalize development of the JUNIPER application: he/she concludes that the JUNIPER application should go into production, i.e., can successfully run on the defined platform for the following reasons: (a) the upper-bound on the response time 6 of the JUNIPER application meets its specification and its model defined in the beginning of this scenario, i.e., the JUNIPER application is able to run/to met the defined real-time constraints in each possible deployment, (b) the JUNIPER application and its programs can be deployed in such way/with such resource utilization that reduces the upper-bound on the response times (it is better than the worst possible deployment; e.g., computation-intensive JUNIPER programs will utilize the nodes with the most powerful CPUs), he/she re-designs the JUNIPER application according to the results of the sensitivity analysis from the previous step in this scenario, e.g., to reduce the response times or to improve suitability of the JUNIPER application for eventual deployment to the defined platform, he/she asks the system administrator to re-define the platform, e.g., to provide more powerful nodes and eventually improve the longest observed execution times and also the response times of the JUNIPER programs and the JUNIPER application. 9. The developer defines a degree of parallelization for each JUNIPER program of the JUNIPER application as a number of instances which should be deployed for the program, based on the results of schedulability analysis and expected data rates (frequencies) of incoming data in the bare model more instance of a JUNIPER program means higher throughput in data processing even in the case of high upper-bound on the response time of the JUNIPER program. 6 the upper-bound on the response time = finishing time release time; also an upper bound on the budget used, in more abstract way (i.e., the maximum amount of time allocated to the budget used; for budgets, see deliverable 3.1 which deals with scheduling and resource reservations techniques) Page 14 Version 1.0

21 10. The system administrator uses the scheduling advisor to deploy the JUNIPER application on the defined platform. For the deployment, the scheduling advisor takes into account the following inputs: the bare model of the JUNIPER application (how to deploy), specified by the modeler in the second step of this scenario, the implementation of the JUNIPER application (what to deploy), provided by the developer in the third step of this scenario, the defined platform for the JUNIPER application (where to deploy), provided by the system administrator in the fourth step of this scenario, the number of instances of individual JUNIPER programs of the JUNIPER application, defined by the developer in the previous step of this scenario. 11. The scheduling advisor performs the deployment of the JUNIPER applications in accordance with the inputs, which are described in the previous step, and with the following rule: instances of the JUNIPER programs are assigned and execute on particular nodes primarily randomly but in such way that instances of the same JUNIPER program will be assigned to as much various types of nodes as possible (the goal is to profile/benchmark various deployments/combinations of the JUNIPER programs and the platform nodes). 12. After the deployment of the JUNIPER application, the scheduling advisor will observe performance of individual JUNIPER programs and check if all defined real-time constraints of (real-time-)critical parts of the JUNIPER application are met, concurrently perform both profiling of the JUNIPER programs and benchmarking of the platform nodes running the JUNIPER programs to obtain information on performance of particular JUNIPER programs on particular platform nodes, perform redeployment of the JUNIPER application with various assignments of the JUNIPER programs on particular platform nodes to get new profile/benchmark data on yet unobserved deployments; this redeployment can be done if and only if (a) he JUNIPER application can be re-executed from time to time (i.e., terminated and executed in the new deployment), which should be defined by the analyst in the first step of this scenario, (b) and redeployment is not disabled by the system administrator, e.g., in the case of production run of the JUNIPER application where a high availability is required, start to optimize deployment, i.e., to deploy the JUNIPER application in the best known way, as soon as (a) there will not be any new data obtainable by further profiling/benchmarking (e.g., all possible program-to-node deployments has been profiled/benchmarked) (b) there is an instantaneous need for the most optimal deployment according to the current profiling and benchmarking data, e.g., in the case the JUNIPER application goes into production. 13. The scheduling advisor will provide the data resulting from profiling/benchmarking to the architect/modeler/developer/system administrator who can utilize the outputs of the scheduling advisor together with the outputs of the schedulability analysis, e.g., Version 1.0 Page 15

22 to analyze the profiling and benchmarking results and specify more accurate the longest observed execution times of the JUNIPER application and its programs for the defined platform and to use these times in schedulability analysis tools in the sixth step of this scenario, to redesign the JUNIPER application to be able to utilize the most powerful platform nodes according to the profiling and benchmarking results to optimize deployment of the JUNIPER application on the defined platform by analysis of the profiling and benchmarking results (e.g., for classification of the platform nodes according to their suitability for individual JUNIPER programs), especially to select the optimal production deployment where any future experiments in re-deployment will not be possible, to optimize the defined platform for the JUNIPER application, e.g., to increase performance of the critical platform nodes according to profiling and benchmarking results. 2.3 Profiling and Benchmarking The proposed scheduling advisor targets to off-line decisions derived from performance measurements, namely from results of profiling of a JUNIPER application on particular deployments on a defined JUNIPER platform and from results of benchmarking of the JUNIPER platform with resources available to the particular JUNIPER application. These two types of performance measurements eventually allows the scheduling advisor to find optimal deployment of the JUNIPER application as it is described in Section 2.2. Despite both the profiling and the benchmarking are carried out for the same goal and can be executed concurrently while running a JUNIPER application, they have to be performed by different means Profiling The profiling of a JUNIPER application is the measurement of its behavior at various levels, i.e., behavior of the whole JUNIPER application as well as behavior of individual JUNIPER programs, and the measurement of resource utilization by the JUNIPER application which is important to understanding the behavior and to evaluate how well the JUNIPER application performs on a deployment platform. Therefore, the profiling of a JUNIPER application requires the ability to observe and to measure behavior and resource utilization of the application and its programs (the application components). The profiling results in a statistical summary of the observed events and the utilized resources, i.e., a profile, and in a sequence of recorded events, i.e, a trace (e.g., events which represent the resource reservation requests). There are two approaches for observing and to measuring behavior and resource utilization of JUNIPER applications and JUNIPER programs: 1. non-intrusive profiling, which is implemented by a platform running the application, i.e., outside of the profiled application or program that need not to be modified for the profiling, 2. intrusive profiling, which is implemented in the profiled application or its programs by its developer (the application or the programs need to be modified for this type of profiling), without participation of the deployment platform. Page 16 Version 1.0

23 The specific implementation of the above mentioned profiling approaches in the JUNIPER project is determined by the implementation technology of JUNIPER applications, which are developed in Java, and by the implementation technology of the JUNIPER platform, which utilizes Message Passing Interface (MPI), or more specifically, the OpenMPI library (see deliverable 4.1 for both Java and OpenMPI utilization). In the case of the non-intrusive profiling, Java provides JVM Tools Interface (JVMTI) API which enables a profiler to use hooks for trapping events like method calls, events of class loading and unloading, events generated by JVM garbage collection, events monitoring application threads, etc. These events will be generated by JVMs running individual JUNIPER programs at the platform nodes. Moreover, the Open MPI library can be modified to generated events on MPI operations called by the JUNIPER programs, transparently without awareness the programs. Contrary to the non-intrusive profiling, the intrusive profiling requires particular modifications of a profiled JUNIPER application and JUNIPER programs. A developer of the profiled application or program performs instrumentation, i.e., he adds to the application or program specific instructions which collect the required information. In comparison with the non-intrusive profiling, the intrusive profiling by the instrumentation is done by the developer, not by the platform, and it has several disadvantages. It requires a special build of the profiled application or program with instrumentation enabled, its can cause performance or even behavioral changes in the profiled application or program caused by the impact and the overhead of profiling instructions (i.e., of the instrumentation), it cannot be easily controlled by a deployment platform (e.g., to focus the profiling to particular parts of the application or program), etc Benchmarking The benchmarking is the process of performance measuring of the JUNIPER platform, i.e., a deployment environment for JUNIPER applications. In the JUNIPER project, the benchmarking will be performed by the scheduling advisor which is introduced in this document and by means of a JUNIPER application deployed on the benchmarked platform. The main goal of such benchmarking is to decide which platform node is the most suitable for a particular JUNIPER program of the JUNIPER application. The most suitable node means the node where the particular JUNIPER program is finished in the shortest time in comparison with duration of the same JUNIPER program performed by other platform nodes. Besides the main goal mentioned above, the benchmarking in the JUNIPER project will discover performance characteristics of individual nodes of the benchmarked platform, such as their processing speed, performance of incoming and outgoing communication channels (latency, bandwidth, throughput, etc.), cost per an unit of computation (e.g., based on power consumption), etc. These performance characteristics do not directly affect the best deployment found before (e.g., in the case of a JUNIPER program which can be accelerated by a particular technology, such as by by static acceleration FPGA, available on a particular node only, not by the computational performance of the node). However, the discovered performance characteristics can be useful for tuning up the JUNIPER application for the benchmarked platform by a developer of the application or for tuning up the platform itself by its system administrator (see the last step of the integration scenario in Section 2.2). Version 1.0 Page 17

D3.5 Operating System with Real-Time Support and FPGA Interface

D3.5 Operating System with Real-Time Support and FPGA Interface Project Number 318763 D3.5 Operating ystem with Real-Time upport and FPGA Interface Version 1.0 Final Public Distribution cuola uperiore ant Anna, University of York Project Partners: aicas, HMI, petafuel,