D3.8 First Prototype of the Real-time Scheduling Advisor

Size: px
Start display at page:

Download "D3.8 First Prototype of the Real-time Scheduling Advisor"

Transcription

1 Project Number D3.8 First Prototype of the Real-time Scheduling Advisor Version 1.0 Final EC Distribution Brno University of Technology Project Partners: aicas, HMI, petafuel, SOFTEAM, Scuola Superiore Sant Anna, The Open Group, University of Stuttgart, University of York Every effort has been made to ensure that all statements and information contained herein are accurate, however the JUNIPER Project Partners accept no liability for any error or omission in the same Copyright in this document remains vested in the JUNIPER Project Partners.

2 Project Partner Contact Information aicas HMI Fridtjof Siebert Markus Schneider Haid-und-Neue Strasse 18 Im Breitspiel 11 C Karlsruhe Heidelberg Germany Germany Tel: Tel: siebert@aicas.com schneider@hmi-tec.com petafuel SOFTEAM Ludwig Adam Andrey Sadovykh Muenchnerstrasse 4 Avenue Victor Hugo Freising Paris Germany France Tel: Tel: ludwig.adam@petafuel.de andrey.sadovykh@softeam.fr Scuola Superiore Sant Anna The Open Group Mauro Marinoni Scott Hansen via Moruzzi 1 Avenue du Parc de Woluwe Pisa 1160 Brussels Italy Belgium Tel: Tel: m.marinoni@sssup.it s.hansen@opengroup.org University of Stuttgart University of York Bastian Koller Neil Audsley Nobelstrasse 19 Deramore Lane Stuttgart York YO10 5GH Germany United Kingdom Tel: Tel: koller@hlrs.de neil.audsley@cs.york.ac.uk Brno University of Technology Pavel Smrž Božetěchova Brno Czech Republic Tel: smrz@fit.vutbr.cz Page ii Version 1.0

3 Contents 1 Introduction Overview and Context Stream Processing on Heterogeneous Clusters Scheduling Decisions in Stream Processing on Heterogeneous Clusters Related Work Objectives and Structure of this Document Scheduling in the JUNIPER Project Motivation Motivating Example Scheduling in the Motivating Example Integration Scenario Profiling and Benchmarking Profiling Benchmarking The Scheduling Advisor Specification Inputs/Outputs and Required/Provided Interfaces Architecture and Prototype Implementation of the Scheduling Advisor Evaluation of the Scheduling Advisor Prototype Conclusion and Future Work Summary and Conclusions Future Work Version 1.0 Page iii

4 List of Figures 1 Architecture of the sample JUNIPER application the graph of JUNIPER programs interconnected by communication channels The non-optimal deployment on the sample JUNIPER application The nearly optimal deployment on the sample JUNIPER application The sequence of the individual steps in development and deployment of a JUNIPER application as they are described in the integration scenario in this section Architecture of the scheduling advisor Scheduler performance comparison cumulative results of multiple tests based on number of tuples processed in time interval Page iv Version 1.0

5 Document Control Version Status Date 0.1 Document outline 25 April Draft of integration scenario 30 April Updated document outline after internal review by project partners 13 May Making the document ready for internal review 16 May First draft for internal review 19 May Final modifications 26 May QA for EC delivery Version 1.0 Page v

6 Page vi Version 1.0

7 Executive summary This document constitutes deliverable 3.8 First Prototype of the Real-time Scheduling Advisor of work package 3 of the JUNIPER project. The purpose of this deliverable is to describe background research, specification, and architecture of the scheduling advisor, its integration into the JUNIPER project, and implementation of its first prototype. The scheduling advisor described in this deliverable will provide an architect, modeler, and developer of a JUNIPER application, or a system administrator of a JUNIPER platform, with information suitable for optimization of the application or the platform, respectively. More specifically, the scheduling advisor performs profiling and benchmarking of various deployments of the application on the platform in order to find the best possible deployment. It provides profiling results of the JUNIPER application and benchmarking results for its individual JUNIPER programs for the various deployments of the application (i.e., placement of the programs), measures the longest observed execution time of individual JUNIPER programs in all possible deployments to improve schedulability analysis, and finds the best possible deployment of the JUNIPER application on the particular platform which indicates the optimal deployment for production use. Deliverable 3.8 which is described in this document is the first step to the final version of the scheduling advisor, which will be described and submitted later in deliverable 3.9 in the third year of the project. Version 1.0 Page 1

8 1 Introduction 1.1 Overview and Context Stream processing is a paradigm evolving in response to well-known limitations of widely adopted MapReduce paradigm for distributed Big Data batch processing, which is well recognized by the JUNIPER project in deliverable Efficient and real-time processing of large streaming data sources and stored data in the Big Data systems targeted by the JUNIPER project requires the underlying platform to provide applications with: high availability and optimal performance to be able to process the continuously incoming data streams in real time and to fulfill the time-constraints imposed by the applications; guaranteed resources to be able to utilize the resources in the processing as necessary and without latencies; and high scalability to be able to adapt to variability in the quality of the incoming data streams, to adapt to different resource availability in heterogeneous clusters, and to provide solutions scalable enough to cater for growing data stream rates in general. Due to these requirements, the underlying platform has to employ advanced resource allocation and scheduling techniques to deploy the applications in optimal way. The resource allocation deals mainly with a problem of gathering and assigning resources to the different requesters while scheduling cares about which tasks and when to place on which previously obtained resources. In the JUNIPER project, the main effort in the resource allocation and scheduling is concentrated in work package 3 where the required system software needed for the JUNIPER project is developed. The effort in work package 3 includes the software support for a micro-scheduling and resource reservation for processes at the operating system level as well as for a high-level scheduling and resource allocation for distributed tasks deployed to individual nodes across a large shared cluster, i.e. in a cloud environment. In order to support hierarchical scheduling in cloud settings and environments in which JUNIPER s intended applications operate, this deliverable discusses the design and prototype implementation of a novel real-time scheduling advisor which is developed in task 3.5 in work package 3. The scheduling advisor will provide developers of JUNIPER applications with information necessary to optimize resource allocation and scheduling of the applications in cloud environments. This includes information on performance of various deployments including the worst-case deployment, profiling information on the applications and their individual components, benchmarking information on the deployment platforms running the applications, etc. This document is a part of deliverable 3.8 and it describes development of the first prototype of the scheduling advisor mentioned above. It is a first step into the implementation of a final version of the scheduling advisor in future deliverable 3.9 to be produced in the third year of the project. The present document specifically focuses on the integration aspects of the scheduling advisor into the JUNIPER project by description of an integration scenario employing products of other deliverables 1 Despite the stream processing is the processing of continuous data streams, it can also easily process stored input data originally intended for the batch processing by sending them in a stream of limited length. Page 2 Version 1.0

9 and by definitions of inputs and outputs of the advisor as well as its required and provided interfaces implemented by the other deliverables. In the general context of the JUNIPER project, this deliverable is highly influenced by other deliverables produced mainly by work packages 2, 3, and 5, during the first year. More specifically, the scheduling advisor utilizes the concept of programming model presented in deliverable 2.1 and the representation of this programming model presented in deliverable 5.2, along with the description of schedulability analysis in deliverable 3.1. In design of the scheduling advisor, we also take into account deliverables 2.2, 2.3, and 3.4, which discuss optimization of Java applications by FPGA and introduce high heterogeneity into distributed processing by the utilization of cluster nodes equipped with FPGA for accelerated computation of particular tasks of a distributed application. 1.2 Stream Processing on Heterogeneous Clusters In homogeneous computing environments, all nodes have identical performance and capacity. Resources can be allocated evenly across all available nodes and effective task scheduling is determined by quantity of the nodes, not by their individual quality. Typically, resource allocation and scheduling in the homogeneous computing environments balance of workload across all the nodes which should have identical workload. Contrary to the homogeneous computing environments, there are different types of nodes in a heterogeneous cluster with various computing performance and capacity. High-performance nodes can complete the processing of identical data faster than low-performance nodes. Moreover, the performance of the nodes depends on the character of computation and on the character of input data. For example, graphic-intensive computations will run faster on nodes that are equipped with powerful GPUs while memory-intensive computation will run faster on nodes with large amount of RAM or disk space. To balance workload in a heterogeneous cluster optimally, a scheduler has to (1) know performance characteristics for individual types of nodes employed in the cluster for different types of computations and to (2) know or to be able to analyze computation characteristics of incoming tasks and input data. The first requirement, i.e., the performance characteristics for individual types of employed nodes, means the awareness of infrastructure and topology of a cluster including detailed specification of its individual nodes. In the most cases, this information is provided at the cluster design-time by its administrators and architects. Moreover, the performance characteristics for individual nodes employed in a cluster can be adjusted at the cluster s run-time based on historical data of performance monitoring and their statistical analysis of processing different types of computations and data by different types of nodes. The second requirement is the knowledge or the ability to analyze computation characteristics of incoming tasks and input data. In batch processing, tasks and data in a batch can be annotated or analyzed in advance, i.e., before the batch is executed, and acquired knowledge can be utilized in optimal allocation of resources and efficient task scheduling. In stream processing, the second requirement is much more difficult to meet due to continuous flow and unpredictable variability of the input data which make thorough analysis of computation characteristics of the input data and incoming tasks impossible, especially with real-time limitations in their processing. Version 1.0 Page 3

10 To address the above mentioned issues of stream processing in heterogeneous clusters with optimal performance, user-defined tasks processing (at least some) of the input data has to help the scheduler. For example, an application may include user-defined helper-tasks tagging input data at run-time by their expected computation characteristics for better scheduling 2. Moreover, individual tasks of a stream application should be tagged at design-time according to their required computation resources and real-time constraints on the processing to help with their future scheduling. Implementation of the mentioned tagging of tasks at design-time should be part of modeling (a meta-model) of topology and infrastructure of such applications and it is described in deliverable 5.1 Foundations for MDE of Big Data Oriented Real-Time Systems. With the knowledge of the performance characteristics for individual types of nodes employed in a cluster and with the knowledge or the ability to analyze computation characteristics of incoming tasks and input data, a scheduler has enough information for balancing workload of the cluster nodes and optimizing throughput of an application. Related scheduling decisions, e.g., re-balancing of the workload, are usually done periodically with an optimal frequency. An intensive re-balancing of the workload across the nodes can cause high overhead while an occasional re-balancing may not utilize all nodes optimally. 1.3 Scheduling Decisions in Stream Processing on Heterogeneous Clusters Schedulers make their decisions on a particular level of abstraction. They do not try to schedule all live tasks to all possible cluster nodes but just deal with units of equal or scalable size. For example, the YARN scheduler [14] uses Containers with various amounts of cores and memory and Apache Storm [2] uses Slots of equal size (one slot per CPU core) where, in each slot, multiple spouts or blots of the same topology may run. One of the important and commonly adopted scheduler decisions is data locality. For instance, the main idea of MapReduce is to perform computations by the nodes where the required data are saved to prevent intensive data loading to and removing from a cluster. Data locality decisions from the stream processing perspective are different because processors usually does not operate on data already stored in a processing cluster but rather on streams coming from remote sources. Thus, in stream processing, we consider the data locality to be an approach to minimal communication costs which results, for example, in scheduling of the most communicating processor instances together to the same node or the same rack. The optimal placement of tasks across cluster nodes may, moreover, depend on other requirements beyond the communication costs mentioned above. Typically, we are talking about CPU performance or overall node performance that makes the processing faster. For example, the performance optimization may lie in detection of tasks which are exceedingly slow in comparison to the others with the same signature. More sophisticated approaches are based on various kinds of benchmarks performed on each node in a cluster while the placement of a task is decided with respect to its detected performance on a particular node or a class of nodes. Furthermore, the presence of resources, e.g., GPU or FPGA, can be taken into account. 2 e.g., parts of variable-bit-rate video streams with temporary high bit-rate will be tagged for processing by special nodes with powerful video decoders, while average bit-rate parts can be processed by common nodes Page 4 Version 1.0

11 There are two essential kinds of scheduling decisions: off-line decisions and on-line decisions. The former is based on the knowledge the scheduler has before any task is placed and running. In context of stream processing, this knowledge is mostly the topology and the off-line decisions can, for example, consider communication channels between nodes. On-line decisions are made with information gathered during the actual execution of an application, i.e., after or during initial placement of its tasks over a cluster nodes. So the counterpart for the off-line topology based communication decision is decision derived from real bandwidths required between running processor instances [2]. In effect, the most of scheduling decisions in stream processing are made on-line or based on historical on-line data. 1.4 Related Work Over the past decade, stream processing has been the subject of much research. Existing approaches can essentially be categorized by scalability into centralized, distributed, and massively-parallel stream processors. In this section, we will focus mainly on distributed and massively-parallel stream processors but also on their successors exploiting ideas of the MapReduce paradigm in the context of the stream processing. In distributed stream processors, related work is mainly based on Aurora* [6], which has been introduced for scalable distributed processing of data streams. An Aurora* system is a set of Aurora* nodes that cooperate via an overlay network within the same administrative domain. The nodes can freely relocate load by decentralized, pairwise exchange of the Aurora stream operators. Sites running Aurora* systems from different administrative domains can be integrated into a single federated system by Medusa [6]. Borealis [1] has introduced a refined QoS optimization model for Aurora*/Medusa where the effects of load shedding on QoS can be computed at every point in the data flow, which enables better strategies for load shedding. Massively-parallel data processing systems, in contrast to the distributed (and also centralized) stream processors, have been designed to run on and efficiently transfer large data volumes between hundreds or even thousands of nodes. Traditionally, those systems have been used to process finite blocks of data stored on distributed file systems. However, newer systems such as Dryad [8], Hyracks [4], CIEL [12], DAGuE [5], or Nephele framework [15] allow to assemble complex parallel data flow graphs and to construct pipelines between individual parts of the flow. Therefore, these parallel data flow systems in general are also suitable for the streaming applications. The latest related work is based mainly on the MapReduce paradigm or its concepts in the context of stream processing. At first, Hadoop Online [7] extended the original Hadoop by ability to stream intermediate results from map to reduce tasks as well as the possibility to pipeline data across different MapReduce jobs. To facilitate these new features, the semantics of the classic reduce function has been extended by time-based sliding windows. Li et al. [10] picked up this idea and further improved the suitability of Hadoop-based systems for continuous streams by replacing the sort-merge implementation for partitioning by a new hash-based technique. The Muppet system [9], on the other hand, replaced the reduce function of MapReduce by a more generic and flexible update function. S4 [13] and Apache Storm [11], which is used in this paper, can also be classified as massivelyparallel data processing systems with a clear emphasis on low latency. They are not based on MapRe- Version 1.0 Page 5

12 duce but allows developers to assemble arbitrarily a complex directed acyclic graph (DAG) of processing tasks. For example, Storm does not use intermediate queues to pass data items between tasks. Instead, data items are passed directly between the tasks using batch messages on the network level to achieve a good balance between latency and throughput. The distributed and massively-parallel stream processors mentioned above usually do not explicitly solve adaptive resource allocation and task scheduling in heterogeneous environments. For example, in [3], authors demonstrate how Aurora*/Medusa handles time-varying load spikes and provides high availability in the face of network partitions. They concluded that Medusa with the Borealis extension does not distribute load optimally but it guarantees acceptable allocations; i.e., either no participant operates above its capacity, or, if the system as a whole is overloaded, then all participants operate at or above capacity. The similar conclusions can be done also in the case of the previously mentioned massively-parallel data processing systems. For example, DAGuE does not target heterogeneous clusters which utilize commodity hardware nodes but can handle intra-node heterogeneity of clusters of supercomputers where a runtime DAGuE scheduler decides at runtime which tasks to run on which resources [5]. Another already mentioned massively-parallel stream processing system, Dryad [8], is equipped with a robust scheduler which takes care of nodes liveness and rescheduling of failed jobs and tracks execution speed of different instances of each processor. When one of these instances under-performs the others, a new instance is scheduled in order to prevent slowdowns of the computation. Dryad scheduler works in greedy mode, it does not consider sharing of cluster among multiple systems. Finally, in the case of approaches based on the MapReduce paradigm or its concepts, resource allocation and scheduling of stream processing on heterogeneous clusters is necessary due to utilization of commodity hardware nodes. In stream processing, data placement and distribution are given by a user-defined topology (e.g., by pipelines in Hadoop Online [7], or by a DAG of interconnected spouts and bolts in Apache Storm [11]). Therefore, approaches to the adaptive resource allocation and scheduling have to discuss initial distribution and periodic re-balancing of workload (i.e., tasks, not data) across nodes according to different processing performance and specialization of individual nodes in a heterogeneous cluster. For instance, S4 [13] uses Apache ZooKeeper to coordinate all operations and for communication between nodes. Initially, a user defines in ZooKeeper which nodes should be used for particular tasks of a computation. Then, S4 employs some nodes as backups for possible node failure and for load balancing. Adaptive scheduling in Apache Storm has been addressed in [2] by two generic schedulers that adapt their behavior according to a topology and a run-time communication pattern of an application. Experiments shown improvement in latency of event processing in comparison with the default Storm scheduler, however, the schedulers do not take into account the requirements discussed in the beginning of Section 1.2, i.e., the explicit knowledge of performance characteristics for individual types of nodes employed in a cluster for different types of computations and the ability to analyze computation characteristics of incoming tasks and input data. By implementation of these requirements, efficiency of the scheduling can be improved. Page 6 Version 1.0

13 1.5 Objectives and Structure of this Document The objectives of this document are based on the Description of Work for task 3.5 of the JUNIPER project and can be summarized as follows: 1. To analyze hierarchical scheduling approaches in cloud settings dealing with virtualized distributed systems that dominate the cloud environments in which JUNIPER s intended applications operate. 2. To extend the current scheduling approaches to address the JUNIPER s focus on real-time in Big Data application contexts. 3. To design and to implement a novel real-time scheduling advisor for the JUNIPER s applications. Section 1 addresses the first objective and introduces the state of the art of scheduling of stream processing on heterogeneous clusters. Such processing is characteristic for the JUNIPER project due to both the processing of Big Data streams and the heterogeneity of distributed processing cluster utilizing commodity hardware nodes as well as specialized nodes with FPGAs. Section 2 addresses the second objective and deals with particular needs of the JUNIPER project in the scheduling of stream processing on heterogeneous clusters which should be addressed by the scheduling advisor. Finally, Section 3 describes the first prototype of the scheduling advisor, its specification, architecture, design and implementation, and possible evaluation, to address the third objective. The document is concluded by Section 4 which sums up contributions of the deliverable 3.8 described in this document and outlines ongoing and future work towards the next deliverable 3.9 that will provide the final version of the scheduling advisor. Version 1.0 Page 7

14 2 Scheduling in the JUNIPER Project 2.1 Motivation According to deliverable 2.1, each JUNIPER application is composed of individual JUNIPER programs performing particular tasks. These JUNIPER programs are annotated interconnected with outside environment via input/output and request/response channels or streams and between themselves via communication channels. The responsibility and the goal of a scheduler in the JUNIPER platform is to deploy individual JUNIPER programs, or the JUNIPER application in general, in the most optimal way where data processing by the JUNIPER application will be the fastest (the JUNIPER application will have the maximal performance). Currently, there is not known an existing scheduler for distributed stream processing platform (see Section 1.4) which would implement desired features, such as the support for heterogeneity of a deployment platform/cluster or integration of profiling and benchmarking information into scheduling decisions. The scheduling advisor, which is described in this document, should improve scheduling decisions in the scheduler and provide useful advices to developers of JUNIPER applications enable them to optimize the application performance and efficient utilization of resources required by the application Motivating Example To demonstrate the scheduling problems anticipated in the current state of the art of stream processing on heterogeneous clusters, i.e. the problems which can be met in the JUNIPER project, a sample application is presented in this section. The application Popular Stories implements a use case of processing of continuous stream of web-pages from thousands of RSS feeds. It analyses the webpages in order to find texts and photos identifying the most popular connections between persons and related keywords and pictures. The result is a list of triples (a person s name, a list of keywords, and a set of photos) with meaning: a person recently frequently mentioned in context of the keywords (e.g., events, objects, persons, etc.) and the photos. The application holds a list of triples with the most often seen persons, keywords, and pictures for some period of time. This way, current trends of persons related to keywords with relevant photos can be obtained 3. As a JUNIPER application, the sample application is composed of its components which are JUNIPER programs as it is depicted in Figure 1. Stream processing performed by the application starts with the URL generator program which extracts URLs of web-pages from incoming RSS feeds by an input channel. After that, the Downloader program gets the (X)HTML source, styles, and pictures of each web-page and encapsulates them into a stand-alone message. The message is passed via a communication channel to the Analyzer program, which searches the web-page for person names and for keywords and pictures in context of the names found. The resulting pairs of name-keyword are sent via a communication channel and stored in tops lists in the In-memory store NK program 3 The application has been developed for demonstration purposes only, particularly to evaluate the scheduling advisor developed for the JUNIPER project and described in the paper. However, the application may be used also in practice, e.g., for visualization of popular news on persons (with related keywords and photos) from news feeds on the Internet. Page 8 Version 1.0

15 where the store is updated each time a new pair arrives and where excessively old pairs are removed from computation of the list. In other words, the window of a time period is held for the tops list. All changes in the tops list are passed via a communication channel to the In-memory store NKP program. Moreover, pairs of name-picture emitted by the Analyzer program are sent via a communication channel and processed by the Image feature extractor program to get indexable features of each image which allow later to detect different instances of the same pictures (e.g., the same photo in different resolution or with different cropping). Then, the image features are sent via a communication channel to the In-memory store NP program where the tops list of the most popular persons and related unique images pairs is held. Finally, all modifications in the tops list of the In-memory store NP program are emitted via a communication channel to the In-memory store NKP program which maintains a consolidated tops list of persons with related keywords and pictures. This tops list is persistent, i.e. it is saved in the Persistent storage, and available for further querying via input/output channels of the JUNIPER application. Figure 1: Architecture of the sample JUNIPER application the graph of JUNIPER programs interconnected by communication channels Version 1.0 Page 9

16 The sample application described above is designed as the JUNIPER application, however, it can be also implemented in other frameworks for distributed stream processing, e.g., as an Apache Storm application. Nevertheless, the application utilizes technologies of the Hadoop platform, e.g., in-memory stores and the final persistent storage in the application employ search engine Apache Lucene 4 with distributed Hadoop-based storage Katta 5 for Lucene indexes to detect different instances of the same pictures as mentioned above. For testing purposes, for an initial evaluation of different scheduling approaches, and finally also for the evaluation of the scheduling advisor described in this document, the sample application has been initially implemented in Apache Storm. Results of the evaluation are presented later in this document Scheduling in the Motivating Example When a JUNIPER application is deployed on a deployment platform, individual JUNIPER programs of the application need to be instantiated and their instances need to be executed on individual nodes of the target platform. This is the task for a scheduler of the target platform, which is controlling instantiation of the JUNIPER programs and the placement of their instances on the available platform nodes, utilizing resources previously allocated to the JUNIPER application. In the case of deployment of a JUNIPER application on a target platform of a heterogeneous cluster, scheduling is not trivial and may result into non-optimal JUNIPER program instantiations and subsequent placement. Let us have a heterogeneous cluster that consists of four nodes with different hardware configurations, more specifically, the nodes with fast CPU, slow CPU, lots of memory, and GPU equipped. Figure 2 is depicting a non-optimal scheduling of the sample application described in Section due to the lack of knowledge about the deployed application and the deployment target platform. After this scheduling, the Analyzers programs are requiring the CPU performance were placed to the node with lots of memory while the memory greedy In-memory stores programs were scheduled to the nodes with powerful GPU and slow CPU which led to the need of higher level of parallelism of MS NP. Moreover, the fast CPU node runs the undemanding Downloaders and the URL generator program. Finally, the Image extractor programs were placed to the slow CPU node and the high memory node. Therefore, it is obvious that the scheduling decisions taken were relatively wrong and the scheduling results into inefficient utilization of the cluster. Contrary to the scheduling described above, the scheduling of the sample JUNIPER application depicted in Figure 3 is quite optimal. The In-memory store programs were deployed on the node with high amount of memory and the Image feature extractor programs were deployed on the node with two GPUs so it was be possible to reduce parallelism. The undemanding Downloaders programs were placed on the Slow CPU node and Analyzers utilize the Fast CPU node Page 10 Version 1.0

17 Downloader; Ax Analyzer; IEx Image feature extractor; MSx In-memory store Figure 2: The non-optimal deployment on the sample JUNIPER application 2.2 Integration Scenario Development and deployment of a JUNIPER application requires cooperation of different roles of responsible participants in the development process. Moreover, in the JUNIPER project, there are several specialized tools, namely the scheduling advisor and the schedulability analysis tools, which help the involved participants with the development and deployment. To describe the cooperation of the roles in the development process and possible utilization of the mentioned tools, we define the following well-established roles: an analyst, who describes required functionality of a JUNIPER application, an architect, who designs the application architecture according to the JUNIPER platform, a modeler, who describes the architecture and design the application in details according to the JUNIPER programming model, a developer, who implements the application utilizing the JUNIPER application interface (API). a system administrator, who deploys the application to a particular cluster running the JUNIPER platform. The development process of a JUNIPER application and the utilization of JUNIPER tools in the the development and deployment of the application by the above mentioned roles is depicted in Figure 4. The individual steps of the development process are described in the following integration scenario: 1. An analyst describes the specification of a JUNIPER application. Version 1.0 Page 11

18 Downloader; Ax Analyzer; IEx Image feature extractor; MSx In-memory store Figure 3: The nearly optimal deployment on the sample JUNIPER application 2. An architect/modeler creates a bare model of the JUNIPER application (which is composed of Juniper programs). The bare model describes: logical architecture of the JUNIPER application (a graph of JUNIPER programs interconnected by communication channels and connected with the outside world by I/O channels and request/response streams, as it is described in deliverable 2.1), data structures (data types) for communication channels, I/O channels, and request/response streams, utilized by the JUNIPER application, expected data rates (frequencies) for processing data coming from the communication channels, I/O channels, and request/response streams, utilized by the JUNIPER application, real-time constraints of (real-time-)critical parts of the JUNIPER application, i.e., of subsets of all Juniper programs in the JUNIPER application. 3. A developer implements the JUNIPER application, i.e., he/she implements and integrates individual Juniper programs of the JUNIPER application. 4. A system administrator defines a platform (a cluster, e.g., BonFIRE) where the JUNIPER application will be eventually deployed, including information about the (possible) allocation of resources within the platform and about the scheduling algorithms used within the single (cluster) nodes. Page 12 Version 1.0

19 Developer Model application Implement application Run schedulability analysis 8 Define degree of parallelism 11 Administrator 4 Define platform Scheduling advisor Optimize deployment 12 Profile application 12.3 Deploy application Numbers in black boxes refer to numbered parts of description. Figure 4: The sequence of the individual steps in development and deployment of a JUNIPER application as they are described in the integration scenario in this section 5. The developer, in cooperation with the system administrator, determines the longest observed execution time for each JUNIPER program of the JUNIPER application running on the defined platform, i.e., the execution time in the case of placement to the slowest platform node, when processing the most difficult input data, performing the slowest I/O operations, the most intensive resource allocations, etc. 6. The architect/modeler/developer uses the schedulability analysis tools to get information on possible schedulability of the JUNIPER application. The schedulability analysis tools will process the following inputs: the bare model including real-time constraints of (real-time-)critical parts of the JUNIPER application, specified by the modeler in the second step of this scenario, the information about the (possible) allocation of resources within the platform and about the scheduling algorithms used within the single (cluster) nodes, provided by the system administrator in the fourth step of this scenario, the longest observed execution time of each JUNIPER program of the JUNIPER application, provided by the developer in cooperation with the system administrator in the fifth step of this scenario, or by the scheduling advisor which can perform a profiling on each Version 1.0 Page 13

20 single JUNIPER program running alone (without any concurrency) as it is described later in step 13 of this scenario. 7. The schedulability analysis tools compute from the inputs, which are described in the previous step, the following results/outputs: an upper-bound on the response time of the overall stream processing in the cases when the real-time constraints specified in the inputs of the schedulability analysis are met, i.e., the response time of the JUNIPER application processing a particular piece of data from its input to its output, an upper-bound on the response time of each individual JUNIPER program of the JUNIPER application in the cases when the real-time constraints specified in the inputs of the schedulability analysis are met, i.e., the response time of each JUNIPER program processing a particular piece of data from its input to its output, a sensitivity analysis, i.e. information on how to modify the design of the JUNIPER application and the allocation/scheduling phase of the JUNIPER programs to improve the real-time performance and response time. 8. The architect/modeler/developer takes into account results of the schedulability analysis and perform the following activities to finalize development of the JUNIPER application: he/she concludes that the JUNIPER application should go into production, i.e., can successfully run on the defined platform for the following reasons: (a) the upper-bound on the response time 6 of the JUNIPER application meets its specification and its model defined in the beginning of this scenario, i.e., the JUNIPER application is able to run/to met the defined real-time constraints in each possible deployment, (b) the JUNIPER application and its programs can be deployed in such way/with such resource utilization that reduces the upper-bound on the response times (it is better than the worst possible deployment; e.g., computation-intensive JUNIPER programs will utilize the nodes with the most powerful CPUs), he/she re-designs the JUNIPER application according to the results of the sensitivity analysis from the previous step in this scenario, e.g., to reduce the response times or to improve suitability of the JUNIPER application for eventual deployment to the defined platform, he/she asks the system administrator to re-define the platform, e.g., to provide more powerful nodes and eventually improve the longest observed execution times and also the response times of the JUNIPER programs and the JUNIPER application. 9. The developer defines a degree of parallelization for each JUNIPER program of the JUNIPER application as a number of instances which should be deployed for the program, based on the results of schedulability analysis and expected data rates (frequencies) of incoming data in the bare model more instance of a JUNIPER program means higher throughput in data processing even in the case of high upper-bound on the response time of the JUNIPER program. 6 the upper-bound on the response time = finishing time release time; also an upper bound on the budget used, in more abstract way (i.e., the maximum amount of time allocated to the budget used; for budgets, see deliverable 3.1 which deals with scheduling and resource reservations techniques) Page 14 Version 1.0

21 10. The system administrator uses the scheduling advisor to deploy the JUNIPER application on the defined platform. For the deployment, the scheduling advisor takes into account the following inputs: the bare model of the JUNIPER application (how to deploy), specified by the modeler in the second step of this scenario, the implementation of the JUNIPER application (what to deploy), provided by the developer in the third step of this scenario, the defined platform for the JUNIPER application (where to deploy), provided by the system administrator in the fourth step of this scenario, the number of instances of individual JUNIPER programs of the JUNIPER application, defined by the developer in the previous step of this scenario. 11. The scheduling advisor performs the deployment of the JUNIPER applications in accordance with the inputs, which are described in the previous step, and with the following rule: instances of the JUNIPER programs are assigned and execute on particular nodes primarily randomly but in such way that instances of the same JUNIPER program will be assigned to as much various types of nodes as possible (the goal is to profile/benchmark various deployments/combinations of the JUNIPER programs and the platform nodes). 12. After the deployment of the JUNIPER application, the scheduling advisor will observe performance of individual JUNIPER programs and check if all defined real-time constraints of (real-time-)critical parts of the JUNIPER application are met, concurrently perform both profiling of the JUNIPER programs and benchmarking of the platform nodes running the JUNIPER programs to obtain information on performance of particular JUNIPER programs on particular platform nodes, perform redeployment of the JUNIPER application with various assignments of the JUNIPER programs on particular platform nodes to get new profile/benchmark data on yet unobserved deployments; this redeployment can be done if and only if (a) he JUNIPER application can be re-executed from time to time (i.e., terminated and executed in the new deployment), which should be defined by the analyst in the first step of this scenario, (b) and redeployment is not disabled by the system administrator, e.g., in the case of production run of the JUNIPER application where a high availability is required, start to optimize deployment, i.e., to deploy the JUNIPER application in the best known way, as soon as (a) there will not be any new data obtainable by further profiling/benchmarking (e.g., all possible program-to-node deployments has been profiled/benchmarked) (b) there is an instantaneous need for the most optimal deployment according to the current profiling and benchmarking data, e.g., in the case the JUNIPER application goes into production. 13. The scheduling advisor will provide the data resulting from profiling/benchmarking to the architect/modeler/developer/system administrator who can utilize the outputs of the scheduling advisor together with the outputs of the schedulability analysis, e.g., Version 1.0 Page 15

22 to analyze the profiling and benchmarking results and specify more accurate the longest observed execution times of the JUNIPER application and its programs for the defined platform and to use these times in schedulability analysis tools in the sixth step of this scenario, to redesign the JUNIPER application to be able to utilize the most powerful platform nodes according to the profiling and benchmarking results to optimize deployment of the JUNIPER application on the defined platform by analysis of the profiling and benchmarking results (e.g., for classification of the platform nodes according to their suitability for individual JUNIPER programs), especially to select the optimal production deployment where any future experiments in re-deployment will not be possible, to optimize the defined platform for the JUNIPER application, e.g., to increase performance of the critical platform nodes according to profiling and benchmarking results. 2.3 Profiling and Benchmarking The proposed scheduling advisor targets to off-line decisions derived from performance measurements, namely from results of profiling of a JUNIPER application on particular deployments on a defined JUNIPER platform and from results of benchmarking of the JUNIPER platform with resources available to the particular JUNIPER application. These two types of performance measurements eventually allows the scheduling advisor to find optimal deployment of the JUNIPER application as it is described in Section 2.2. Despite both the profiling and the benchmarking are carried out for the same goal and can be executed concurrently while running a JUNIPER application, they have to be performed by different means Profiling The profiling of a JUNIPER application is the measurement of its behavior at various levels, i.e., behavior of the whole JUNIPER application as well as behavior of individual JUNIPER programs, and the measurement of resource utilization by the JUNIPER application which is important to understanding the behavior and to evaluate how well the JUNIPER application performs on a deployment platform. Therefore, the profiling of a JUNIPER application requires the ability to observe and to measure behavior and resource utilization of the application and its programs (the application components). The profiling results in a statistical summary of the observed events and the utilized resources, i.e., a profile, and in a sequence of recorded events, i.e, a trace (e.g., events which represent the resource reservation requests). There are two approaches for observing and to measuring behavior and resource utilization of JUNIPER applications and JUNIPER programs: 1. non-intrusive profiling, which is implemented by a platform running the application, i.e., outside of the profiled application or program that need not to be modified for the profiling, 2. intrusive profiling, which is implemented in the profiled application or its programs by its developer (the application or the programs need to be modified for this type of profiling), without participation of the deployment platform. Page 16 Version 1.0

23 The specific implementation of the above mentioned profiling approaches in the JUNIPER project is determined by the implementation technology of JUNIPER applications, which are developed in Java, and by the implementation technology of the JUNIPER platform, which utilizes Message Passing Interface (MPI), or more specifically, the OpenMPI library (see deliverable 4.1 for both Java and OpenMPI utilization). In the case of the non-intrusive profiling, Java provides JVM Tools Interface (JVMTI) API which enables a profiler to use hooks for trapping events like method calls, events of class loading and unloading, events generated by JVM garbage collection, events monitoring application threads, etc. These events will be generated by JVMs running individual JUNIPER programs at the platform nodes. Moreover, the Open MPI library can be modified to generated events on MPI operations called by the JUNIPER programs, transparently without awareness the programs. Contrary to the non-intrusive profiling, the intrusive profiling requires particular modifications of a profiled JUNIPER application and JUNIPER programs. A developer of the profiled application or program performs instrumentation, i.e., he adds to the application or program specific instructions which collect the required information. In comparison with the non-intrusive profiling, the intrusive profiling by the instrumentation is done by the developer, not by the platform, and it has several disadvantages. It requires a special build of the profiled application or program with instrumentation enabled, its can cause performance or even behavioral changes in the profiled application or program caused by the impact and the overhead of profiling instructions (i.e., of the instrumentation), it cannot be easily controlled by a deployment platform (e.g., to focus the profiling to particular parts of the application or program), etc Benchmarking The benchmarking is the process of performance measuring of the JUNIPER platform, i.e., a deployment environment for JUNIPER applications. In the JUNIPER project, the benchmarking will be performed by the scheduling advisor which is introduced in this document and by means of a JUNIPER application deployed on the benchmarked platform. The main goal of such benchmarking is to decide which platform node is the most suitable for a particular JUNIPER program of the JUNIPER application. The most suitable node means the node where the particular JUNIPER program is finished in the shortest time in comparison with duration of the same JUNIPER program performed by other platform nodes. Besides the main goal mentioned above, the benchmarking in the JUNIPER project will discover performance characteristics of individual nodes of the benchmarked platform, such as their processing speed, performance of incoming and outgoing communication channels (latency, bandwidth, throughput, etc.), cost per an unit of computation (e.g., based on power consumption), etc. These performance characteristics do not directly affect the best deployment found before (e.g., in the case of a JUNIPER program which can be accelerated by a particular technology, such as by by static acceleration FPGA, available on a particular node only, not by the computational performance of the node). However, the discovered performance characteristics can be useful for tuning up the JUNIPER application for the benchmarked platform by a developer of the application or for tuning up the platform itself by its system administrator (see the last step of the integration scenario in Section 2.2). Version 1.0 Page 17

D3.5 Operating System with Real-Time Support and FPGA Interface

D3.5 Operating System with Real-Time Support and FPGA Interface Project Number 318763 D3.5 Operating ystem with Real-Time upport and FPGA Interface Version 1.0 Final Public Distribution cuola uperiore ant Anna, University of York Project Partners: aicas, HMI, petafuel,

More information

D4.5 OS Allocation and Migration Support Final Release

D4.5 OS Allocation and Migration Support Final Release Project Number 611411 D4.5 OS Allocation and Migration Support Final Release Version 1.01 Final Public Distribution University of York, aicas, Bosch and University of Stuttgart Project Partners: aicas,

More information

D3.6 Final Operating System with Real-Time Support

D3.6 Final Operating System with Real-Time Support Project Number 318763 D3.6 Final Operating System with Real-Time Support Version 1.0 Final Public Distribution Scuola Superiore Sant Anna Project Partners: aicas, HMI, petafuel, SOFTEAM, Scuola Superiore

More information

D2.3 Static Acceleration Design

D2.3 Static Acceleration Design Project Number 318763 D2.3 Static Acceleration Design Version 1.0 Final Public Distribution University of York, aicas, SOFTEAM Scuola Superiore Sant Anna, University of Stuttgart Project Partners: aicas,

More information

Scalable Streaming Analytics

Scalable Streaming Analytics Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according

More information

Twitter Heron: Stream Processing at Scale

Twitter Heron: Stream Processing at Scale Twitter Heron: Stream Processing at Scale Saiyam Kohli December 8th, 2016 CIS 611 Research Paper Presentation -Sun Sunnie Chung TWITTER IS A REAL TIME ABSTRACT We process billions of events on Twitter

More information

Designing a System Engineering Environment in a structured way

Designing a System Engineering Environment in a structured way Designing a System Engineering Environment in a structured way Anna Todino Ivo Viglietti Bruno Tranchero Leonardo-Finmeccanica Aircraft Division Torino, Italy Copyright held by the authors. Rubén de Juan

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

STORM AND LOW-LATENCY PROCESSING.

STORM AND LOW-LATENCY PROCESSING. STORM AND LOW-LATENCY PROCESSING Low latency processing Similar to data stream processing, but with a twist Data is streaming into the system (from a database, or a netk stream, or an HDFS file, or ) We

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3 System Issues Architecture

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat Grid Workflow Efficient Enactment for Data Intensive Applications L3.4 Data Management Techniques Authors : Eddy Caron Frederic Desprez Benjamin Isnard Johan Montagnat Summary : This document presents

More information

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

CHAPTER 7 CONCLUSION AND FUTURE SCOPE 121 CHAPTER 7 CONCLUSION AND FUTURE SCOPE This research has addressed the issues of grid scheduling, load balancing and fault tolerance for large scale computational grids. To investigate the solution

More information

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors. About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache

More information

Project Number D7.1 Final Report on Dissemination. Version January 2016 Final. Public Distribution. The Open Group

Project Number D7.1 Final Report on Dissemination. Version January 2016 Final. Public Distribution. The Open Group Project Number 318736 D7.1 Final Report on Dissemination Version 1.3 18 January 2016 Final Public Distribution The Open Group Project : aicas, HMI, petafuel, SOFTEAM, Scuola Superiore Sant Anna, The Open

More information

Apache Flink. Alessandro Margara

Apache Flink. Alessandro Margara Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

D3.2 Prototype Operating System with Real-Time Support

D3.2 Prototype Operating System with Real-Time Support Project Number 318763 D3.2 Prototype Operating System with Real-Time Support Version 1.0 Final Public Distribution Scuola Superiore Sant Anna, University of York Project Partners: aicas, HMI, petafuel,

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Chapter 3. Design of Grid Scheduler. 3.1 Introduction Chapter 3 Design of Grid Scheduler The scheduler component of the grid is responsible to prepare the job ques for grid resources. The research in design of grid schedulers has given various topologies

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Layers of an information system. Design strategies.

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Layers of an information system. Design strategies. Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 2 Distributed Information Systems Architecture Chapter Outline

More information

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 2 Distributed Information Systems Architecture Chapter Outline

More information

Stream Processing on IoT Devices using Calvin Framework

Stream Processing on IoT Devices using Calvin Framework Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised

More information

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell R-Storm: A Resource-Aware Scheduler for STORM Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell Introduction STORM is an open source distributed real-time data stream processing system

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Apache Storm. Hortonworks Inc Page 1

Apache Storm. Hortonworks Inc Page 1 Apache Storm Page 1 What is Storm? Real time stream processing framework Scalable Up to 1 million tuples per second per node Fault Tolerant Tasks reassigned on failure Guaranteed Processing At least once

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

CHAPTER 6 ENERGY AWARE SCHEDULING ALGORITHMS IN CLOUD ENVIRONMENT

CHAPTER 6 ENERGY AWARE SCHEDULING ALGORITHMS IN CLOUD ENVIRONMENT CHAPTER 6 ENERGY AWARE SCHEDULING ALGORITHMS IN CLOUD ENVIRONMENT This chapter discusses software based scheduling and testing. DVFS (Dynamic Voltage and Frequency Scaling) [42] based experiments have

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data FedX: A Federation Layer for Distributed Query Processing on Linked Open Data Andreas Schwarte 1, Peter Haase 1,KatjaHose 2, Ralf Schenkel 2, and Michael Schmidt 1 1 fluid Operations AG, Walldorf, Germany

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles

More information

Schema-Agnostic Indexing with Azure Document DB

Schema-Agnostic Indexing with Azure Document DB Schema-Agnostic Indexing with Azure Document DB Introduction Azure DocumentDB is Microsoft s multi-tenant distributed database service for managing JSON documents at Internet scale Multi-tenancy is an

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially

More information

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà WHO ARE WE? Marc Sturlese - @sturlese Backend engineer, focused on R&D Interests: search, scalability Dani Solà - @dani_sola Backend engineer

More information

Storage Networking Strategy for the Next Five Years

Storage Networking Strategy for the Next Five Years White Paper Storage Networking Strategy for the Next Five Years 2018 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 1 of 8 Top considerations for storage

More information

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Java Message Service (JMS) is a standardized messaging interface that has become a pervasive part of the IT landscape

More information

OpenCache. A Platform for Efficient Video Delivery. Matthew Broadbent. 1 st Year PhD Student

OpenCache. A Platform for Efficient Video Delivery. Matthew Broadbent. 1 st Year PhD Student OpenCache A Platform for Efficient Video Delivery Matthew Broadbent 1 st Year PhD Student Motivation Consumption of video content on the Internet is constantly expanding Video-on-demand is an ever greater

More information

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

D6.1 Integration Report

D6.1 Integration Report Project Number 611411 D6.1 Integration Report Version 1.0 Final Public Distribution Bosch with contributions from all partners Project Partners: aicas, Bosch, CNRS, Rheon Media, The Open Group, University

More information

Mark Sandstrom ThroughPuter, Inc.

Mark Sandstrom ThroughPuter, Inc. Hardware Implemented Scheduler, Placer, Inter-Task Communications and IO System Functions for Many Processors Dynamically Shared among Multiple Applications Mark Sandstrom ThroughPuter, Inc mark@throughputercom

More information

Grid Computing Systems: A Survey and Taxonomy

Grid Computing Systems: A Survey and Taxonomy Grid Computing Systems: A Survey and Taxonomy Material for this lecture from: A Survey and Taxonomy of Resource Management Systems for Grid Computing Systems, K. Krauter, R. Buyya, M. Maheswaran, CS Technical

More information

Priority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform

Priority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform Priority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform By Rudraneel Chakraborty A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment

More information

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Big Data Using Hadoop

Big Data Using Hadoop IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for

More information

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko RAMCloud Scalable High-Performance Storage Entirely in DRAM 2009 by John Ousterhout et al. Stanford University presented by Slavik Derevyanko Outline RAMCloud project overview Motivation for RAMCloud storage:

More information

Welcome to the New Era of Cloud Computing

Welcome to the New Era of Cloud Computing Welcome to the New Era of Cloud Computing Aaron Kimball The web is replacing the desktop 1 SDKs & toolkits are there What about the backend? Image: Wikipedia user Calyponte 2 Two key concepts Processing

More information

Data Model Considerations for Radar Systems

Data Model Considerations for Radar Systems WHITEPAPER Data Model Considerations for Radar Systems Executive Summary The market demands that today s radar systems be designed to keep up with a rapidly changing threat environment, adapt to new technologies,

More information

1 Publishable Summary

1 Publishable Summary 1 Publishable Summary 1.1 VELOX Motivation and Goals The current trend in designing processors with multiple cores, where cores operate in parallel and each of them supports multiple threads, makes the

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization Lucas M. Schnorr, Arnaud Legrand, and Jean-Marc Vincent e-mail : Firstname.Lastname@imag.fr Laboratoire d Informatique

More information

New research on Key Technologies of unstructured data cloud storage

New research on Key Technologies of unstructured data cloud storage 2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State

More information

Tag Switching. Background. Tag-Switching Architecture. Forwarding Component CHAPTER

Tag Switching. Background. Tag-Switching Architecture. Forwarding Component CHAPTER CHAPTER 23 Tag Switching Background Rapid changes in the type (and quantity) of traffic handled by the Internet and the explosion in the number of Internet users is putting an unprecedented strain on the

More information

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT Hardware Assisted Recursive Packet Classification Module for IPv6 etworks Shivvasangari Subramani [shivva1@umbc.edu] Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!

More information

A RESOURCE AWARE SOFTWARE ARCHITECTURE FEATURING DEVICE SYNCHRONIZATION AND FAULT TOLERANCE

A RESOURCE AWARE SOFTWARE ARCHITECTURE FEATURING DEVICE SYNCHRONIZATION AND FAULT TOLERANCE A RESOURCE AWARE SOFTWARE ARCHITECTURE FEATURING DEVICE SYNCHRONIZATION AND FAULT TOLERANCE Chris Mattmann University of Southern California University Park Campus, Los Angeles, CA 90007 mattmann@usc.edu

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

TUTORIAL: WHITE PAPER. VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS

TUTORIAL: WHITE PAPER. VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS TUTORIAL: WHITE PAPER VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS 1 1. Introduction The Critical Mid-Tier... 3 2. Performance Challenges of J2EE Applications... 3

More information

Analysis of Sorting as a Streaming Application

Analysis of Sorting as a Streaming Application 1 of 10 Analysis of Sorting as a Streaming Application Greg Galloway, ggalloway@wustl.edu (A class project report written under the guidance of Prof. Raj Jain) Download Abstract Expressing concurrency

More information

Nowadays data-intensive applications play a

Nowadays data-intensive applications play a Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju Distributed Data Infrastructures, Fall 2017, Chapter 2 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note: Term Warehouse-scale

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Storm at Twitter Twitter Web Analytics Before Storm Queues Workers Example (simplified) Example Workers schemify tweets and

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Optimizing Apache Spark with Memory1. July Page 1 of 14

Optimizing Apache Spark with Memory1. July Page 1 of 14 Optimizing Apache Spark with Memory1 July 2016 Page 1 of 14 Abstract The prevalence of Big Data is driving increasing demand for real -time analysis and insight. Big data processing platforms, like Apache

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT: HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com

More information

Flying Faster with Heron

Flying Faster with Heron Flying Faster with Heron KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN I! II ( III b OVERVIEW MOTIVATION HERON IV Z OPERATIONAL EXPERIENCES V K HERON PERFORMANCE END [! OVERVIEW TWITTER IS

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Comprehensive Guide to Evaluating Event Stream Processing Engines

Comprehensive Guide to Evaluating Event Stream Processing Engines Comprehensive Guide to Evaluating Event Stream Processing Engines i Copyright 2006 Coral8, Inc. All rights reserved worldwide. Worldwide Headquarters: Coral8, Inc. 82 Pioneer Way, Suite 106 Mountain View,

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Catalogic DPX TM 4.3. ECX 2.0 Best Practices for Deployment and Cataloging

Catalogic DPX TM 4.3. ECX 2.0 Best Practices for Deployment and Cataloging Catalogic DPX TM 4.3 ECX 2.0 Best Practices for Deployment and Cataloging 1 Catalogic Software, Inc TM, 2015. All rights reserved. This publication contains proprietary and confidential material, and is

More information

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Venugopalan Ramasubramanian Emin Gün Sirer Presented By: Kamalakar Kambhatla * Slides adapted from the paper -

More information

Towards Modeling Approach Enabling Efficient Platform for Heterogeneous Big Data Analysis.

Towards Modeling Approach Enabling Efficient Platform for Heterogeneous Big Data Analysis. Towards Modeling Approach Enabling Efficient Platform for Heterogeneous Big Data Analysis Andrey.Sadovykh@softeam.fr www.modeliosoft.com 1 Outlines Introduction Model-driven development Big Data Juniper

More information

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS Radhakrishnan R 1, Karthik

More information