Scheduling Data- and Compute-intensive Applications in Hierarchical Distributed Systems

Size: px
Start display at page:

Download "Scheduling Data- and Compute-intensive Applications in Hierarchical Distributed Systems"

Transcription

1 Scheduling Data- and Compute-intensive Applications in Hierarchical Distributed Systems Matthias Röhm, Matthias Grabert and Franz Schweiggert Institute of Applied Information Processing Ulm University Ulm, Germany Abstract The growing computerization in modern academic and industrial sectors is generating huge volumes of electronic data. Hierarchical distributed systems based on Grid and Cloud technologies promise to meet the tremendously rising resource requirements of heterogeneous, large-scale and distributed data mining applications. Scheduling plays a pivotal role in such environments. While most schedulers addressing these new challenges have a strong focus on compute-intensive applications, we introduce a new scheduling algorithm to support both compute- and data-intensive applications in dynamic, heterogeneous, hierarchical environments. The developed data-aware scheduling algorithm aims to minimize the completion times of the applications as well as their costs leading to an efficient utilization of all available resources. The algorithm is specifically designed for combined storage and compute resources as these allow jobs to be executed on resources storing the data sets and thus are the key to avoid time-consuming and expensive data transfers. Simulations and first real-world usage experiences in the Fleet Data Acquisition Miner for analyzing the data generated by the Daimler fuel cell vehicle fleet show that the algorithm is suited for the different aspects of today s data analysis challenges. Keywords- data-intensive; scheduling; Cloud; Grid. I. INTRODUCTION Increasing data volumes in many industrial and academic sectors are fueling the need for novel data analysis solutions to extract valuable information. Data mining, as the key methodology to address these information needs, requires effective and efficient resource management to transform the growing data into knowledge. There have been multitudes of efforts to provide specialized resource management solutions for complex data mining scenarios, including peer-to-peer data mining, distributed data stream mining and parallel data mining [1] [2]. Recently, data mining research and development has put a focus on highly data-intensive applications. Google s publications on MapReduce [3][4] inspired many projects working on large data sets. MapReduce frameworks, like Hadoop [5], simplify the development and deployment of peta-scale data mining applications leveraging thousands of machines. MapReduce frameworks are highly scalable because the scheduler uses data location information to avoid data movement and rather send the algorithms to the data. Other current distributed data mining research is motivated by the sharing of heterogeneous, geographic distributed, dynamic resources from multiple administrative domains to support the cooperation of different organizations [6] [7] [8]. This field is generally referred to as data mining in Grid computing environments and is highly related to the Cloud computing paradigm. Most research focused on compute-intensive applications following scheduling principles that are correct for compute-intensive, but not for data-intensive data mining applications. Though different scheduling algorithms have been proposed to optimize the relation between data transfer and execution time [9][10] [11], for data-intensive applications where the limiting factor is not CPU-power but rather storage and network speed, the underlying architecture and environment assumptions may lead to non-optimal schedules. In addition, with the almost unlimited resources available through Cloud providers the traditional scheduling concept where jobs have to be assigned to a set of limited resources is not valid any more. Instead the scheduler has to assign jobs to the resource(s) that best fits the needs of the job while considering the cost to execution time ratio. As current data mining applications are both computeand data-intensive we developed an architecture based on the notion of combined compute and storage resources to bring the advantages of the MapReduce paradigm into worldwide, heterogeneous general-purpose computing environments [12]. In this article we present a multi-objective scheduling algorithm for this dynamic, hierarchical Grid architecture incorporating the Cloud computing resource concepts. This article is organized as follows: First, we briefly introduce the generalized architecture and the implications to data- and compute-intensive scheduling. Then we describe the developed multi-objective scheduling algorithm and compare it with existing Grid scheduling algorithms. Finally, we present the simulation results of the developed scheduling algorithm. II. A GENERALIZED GRID SCHEDULING ARCHITECTURE Grid scheduling algorithms are responsible for mapping application resource requests to available resources. In com-

2 parison with other scheduling problems the scheduler has to make a decision under the following environmental characteristics: An application is represented by one or more jobs which might have dependencies. Each job consists of an executable with parameters, a set of input and output data specifications and requirements. The execution time for a given job is not known in general. A job may require one or more compute resources. A data set may be stored on multiple storage resources. The Grid environment is dynamic and resources are heterogeneous. There is no reliable source of information. Jobs arrive at an unpredictable rate. The main goal of the Grid scheduler is to produce a schedule for all jobs arriving over time that minimizes a given objective function - makespan, average completion time or cost - under these constraints. As this scheduling problem is related to problems that are known to be NPcomplete, there is little chance a polynomial algorithm exists to solve it [13]. Therefore Grid Scheduling algorithms use the structure of the Grid environment to implement heuristics or approximations. For compute-intensive application scenarios the main resource and the limiting factor is CPU-power and the focus of the Grid scheduler is to efficiently use the compute power of multiple compute clusters. In these scenarios it is commonly assumed that the time needed to transfer the input and output data is relatively small compared to the overall execution time. These assumptions lead to the following architecture which forms the basis of current Grid schedulers: (1) Specialized storage servers to store input and output data as well as executables. (2) A set of compute clusters from different organizations each composed of multiple compute nodes for running the algorithms. To provide a high level of transparency, these clusters are treated as one multi-cpu resource. (3) In a traditional Grid scheduling setup multiple clusters from different organisations are connected through relatively slow wide area networks whereas the network bandwidth within an organization is assumed to be infinite. A setup like this fits the needs of compute-intensive applications: To schedule a compute-intensive application requesting n computational resources, the scheduler only has to look for a cluster that has the best n free compute resources. As data transfer time is small compared to the execution time the transfer overhead is sometimes neglected. For most data-intensive applications this assumption can not be hold. In contrary, not CPU-power but storage and network speed are the limiting factors of data-intensive applications. Data-intensive applications therefore require new scheduling strategies as the input data transfer time may well exceed execution time. Now, the scheduler should choose compute and storage resources so that the overall time or cost - depending on the scheduling objective - is minimized. Obviously, a scheduler assuming infinite bandwidth within an organization may produce non-optimal schedules. Another aspect of traditional Grid schedulers is the assumption that there is only a very limited resource set the jobs have to be scheduled to. But with the advances in Cloud computing almost unlimited (compute-) resources are available to the scheduler. We identified the need for some major conceptual enhancements to traditional Grid scheduling architectures to efficiently support compute- and data-intensive applications considering Cloud resources: (1) As the amount of data increases, data can not be efficiently stored and processed from a single storage server within a cluster but has to be distributed over multiple machines. Therefore any resource may provide storage and compute capacity. This combined resource type forms the basis for scalable data-intensive applications as data can be processed directly on the storage location. To increase storage capacity and speed, computational resources of compute clusters may become combined resources by storing data on their local disks. It is important to point out that the combined resources are suitable for data- and computeintensive applications as only new functionality is added. (2) The environment is hierarchical. Resources are mainly organized within a cluster; an organization may have multiple clusters on-site or in the Cloud; and multiple organisations may want to share their resources. This not only implies an administrative hierarchy but also the network connecting the different entities has a hierarchical structure. Resources within a cluster are generally connected through a high-bandwidth low-latency interconnect (Infiniband), whereas inter-cluster or inter-organization network speed is typically much slower. (3) Data-intensive applications are limited by two factors: the storage and network speed. Scheduling algorithms should efficiently use these resources and avoid unnecessary input data transfer through processing the data directly on the resources storing the data or resources nearby. (4) There exist internal and external resources. There are only a limited number of internal resources while any number of external resources may be added at an additional cost. III. A MULTI-OBJECTIVE SCHEDULING ALGORITHM A basic algorithm trying to minimize the average runtime of all jobs (J) for combined compute and storage resources in Grids was presented in [14]. With the increasing availability of Cloud resources this basic scheduling has to be adopted to include the cost of the Cloud resources as well as the cost associated to the wide area networks needed to integrate these resources. Also internal resources might be assigned with costs to encourage a more efficient usage. In addition to low job execution costs (K j ), the users are

3 interested in getting their results as soon as possible so that the completion time of the jobs (C j is another dimension of the scheduling problem. To balance between the cost and the completion time of all jobs the scheduler should minimize the following objective function: F (J) = j J C j K j (1) As not all jobs are known to the scheduler in advance (offline-scheduling) but rather appear over time (onlinescheduling) the minimum of F (J) can only be computed retrospectively. Therefore the scheduler can only schedule a subset of the jobs at a time and can only approximate the minimum of F (J). Data-intensive jobs might be defined as a tuple j = (p, D) where D = d 1,..., d m are the data sets to be processed by program p. The task of the scheduler is to select a tuple (r, S) for each job minimizing the objective function, where r is a compute resource and S = (s 1,..., s m ) is the set of storage resources providing the m data subsets. It is assumed that all data subsets d D have to be available on the execution resource before they can be processed by p. The algorithm presented in this paper uses the hierarchical structure of the environment and a cost to completion time ratio to decide what set of resources should be used for a job. This dual-objective hierarchical scheduling algorithm (DOHS) is depicted in Figure 1. As described above, the algorithm has to produce a schedule with little information about the current state of the system and the jobs. Therefore the algorithm was designed to only require relative storage and compute speeds of the resources as well as the corresponding relative costs. The inputs of the scheduling algorithm are the set of input data D, the program p, the data transfer scheduling weights α 1 to α 4 and the data to compute ratio weights β 1, β 2. As all data subsets d D have to be available on one execution resource, the scheduler has to select a tuple (r, S) for each job where r is a compute resource and S is the set of storage resources providing the data subsets. First, the algorithm produces the set of candidate execution resources as the best (highest compute speed to cost ratio f c ) resources from each cluster in the Grid and all resources storing at least one of the data sets d D. For each of these candidate resources the set of storage resources with minimal aggregated transfer overhead f s with regard to D is generated. From all candidates the scheduler chooses the one with the highest priority. The priority of a resource assignment (r, S, t, c) is computed as the weighted sum of the normalized transfer (t) and compute overhead (c). The algorithm and the functions are based on the following definitions: P := all programs available in the Grid; R := {r 1,..., r n } is the set of all n resources; D := {d 1,..., d m } is the m data sets of the job; N := {(r, s) r, s R, r and s can exchange data directly}; R d := { r r R stores d D }; D r := { d d D is stored on r R }; c r is the cluster of resource r; g r is the grid site of resource r; sp r is the storage speed of resources r and sc r is its corresponding cost; and cp r is the compute power and cc r is the compute cost of resource r with respect to program p, where cp r = 0 and cc r = if r does not fulfill all requirements of p. Different properties of a resource may be used to define the computing and storage power and cost of a resource, but at least the current usage has to be taken into account. The data transfer overhead of a candidate resource tuple (r, S) is computed as the sum of all transfer overheads (r, s). Due to the incomplete environment information, especially missing or imprecise network information, the scheduler uses the weights α 1 to α 4 that represent the hierarchical structure. The data transfer overhead function f s 13 assigns the weights α 1 to α 4 to the product of storage power and cost of s according to the distance between s and r: α 1 if d is stored on the resource itself r = s; α 2 if d is stored on a resource on the same cluster c r = c s ; α 3 if d is on the same Grid instance g r = g s ; and α 4 if d is on another Grid instance g r g s. In case both resources are not able to exchange data directly, each resource needed to transfer the data set d from s to r is also considered. FUNCTION f s (r R, s R, d D + ) Choose shortest path s 1,..., s k from s to r with s 0 = s so that (s 0, s 1 ),..., (s k, r) N t size(d) k i=0 sc s i /sp si if r = s then t α 1 t else if c r = c s then t α 2 t else if g r = g s then t α 3 t else t α 4 t end if return t END FUNCTION As can easily be seen the data transfer scheduling weights α 1 to α 4 may be chosen to approximate the actual network bandwidth topology and cost or can be used minimize inter-cluster or inter-organization transfers. FUNCTION f c (p P, r R) if r fullfils not all requirements of p then return end if nc r number of CPUs of r

4 FUNCTION DOHS (p P, R, D D +, α 1 4, β 1 2 ) R m { ˆr ˆr R f c (p, ˆr) = min{f c (p, r) r R cˆr = c r } } R s { r r R D f c (p, r) < } Z for all r R m R s do t r 0, S r for all d D do Find ŝ R d with f s (r, ŝ, d) = min{f s (r, s, d) s R d } t r t r + f s (r, ŝ, d) S r S r {ŝ} end for Z Z {( r, S r, t r, f c (p, r) )} end for t min min{ t (r, S, t, c) Z } c min max{ c (r, S, t, c) Z } Find (ˆr, Ŝ, ˆt, ĉ) Z with β 1 tmin ˆt return (ˆr, Ŝ, ˆt, ĉ) END FUNCTION + β 2 cmin ĉ = max{β 1 tmin t + β 2 cmin c (r, S, t, c) Z} Figure 1. DOHS algorithm for scheduling a program with multiple data sets u r currently reserved CPUs of r cpu r CPUs speed of r cpu min min{cpu o o R nc o > u o } if nc r = u r then cp r cpu r else if nc r u r o R with nc o > u o then cp r (nc r cpu min )/(nc r + u r ) else cp r (nc r cpu r )/(nc r + u r ) end if return cc r /cp r END FUNCTION The compute overhead function f c 15 returns the compute power to cost ratio of the resource with respect to program p. If r does not fulfill all requirements of p the function returns. In case there is at least one free CPU, the compute power is simply the CPU speed of the resource. If all CPUs are used but another resource in the Grid has a free CPU, the compute power is defined as the product of the number of CPUs times the CPU speed of the slowest available resource divided by the number of CPUs plus the reserved CPUs. Using the speed of the slowest available resource ensures, that the compute power of a busy resource is never higher then the compute power of a resource with free CPUs. If there is no free CPU in the Grid, the compute power is defined using the resource s CPU speed. IV. SIMULATION RESULTS To evaluate the presented algorithm we developed a simulation environment for executing data- and computeintensive jobs in Grids with additional Cloud resources. In the first step of a simulation a random number of resources (> 100), clusters and organisations are created. Compute, storage and network speeds of each resource are randomly generated as well. Also the number of jobs, the number of computations per MB and the data sets per job are generated randomly. In the next step different scheduling algorithms are used to schedule the jobs as they arrive over time in the created Grid environment. For each algorithm the resulting schedule is simulated and the exact cost and time consumption of each job is computed. Based on the consumed cost and time F (J) is calculated for each algorithm. Following the Monte Carlo simulation approach 20 of these simulations were conducted and the j J C j K j of each was recorded. As a benchmark we use a brute-force algorithm evaluating all possible resource combinations for each job minimizing different objective functions. As this requires n s m (n compute resources, jobs have in average m data sets and each data set is stored in average on s resources) objective function evaluations it is not feasible to use this algorithm for real-world scheduling. But as a benchmark, it ensures that the (locally) optimal resources are chosen according to the objective function. In contrast to the developed algorithm the benchmark algorithm is also provided with all environment and job information, including exact network bandwidth and the job s execution time. The

5 Sum Cost Sum Cost x Completiontime benchmark algorithm is used to compute a schedule based on the following objective functions that are commonly used for grid scheduling: Cost-Time. Minimize the product of the jobs cost and completion time. Time. Minimize the completion time of the job. Cost. Minimize the cost of the job. Transfer. The scheduler chooses an execution resource that has a minimum transfer overhead. Time-Grid. Minimize the completion time of the job based the assumption that transfer overhead and cost is zero within an organisation. As shown in Figure 2 the developed DOHS scheduling algorithm provides good performance compared to the benchmark algorithm using the different objective functions. The DOHS algorithm achieves almost the same performance as the benchmark algorithm using the Cost-Time objective function and all information. The Time-Grid objective function may be regarded as a representative of current Grid schedulers that assume that transfer overhead and cost is zero within an organisation. It does not only require the more execution time but also consumes the more costs as the Time objective function, especially for data-intensive applications. Only the Transfer objective function, which can be seen as a generalization of the MapReduce scheduling approach of minimizing the transfer overhead, is worse. We evaluated the algorithms for a mix of compute- and dataintensive jobs. Figure 2 a) shows two different values for the DOHS algorithm. The DOHS-1 represents a scenario where users classify the jobs as compute-intensive (β 1 = 0.9, β 2 = 1) or data-intensive (β 1 = 1, β 2 = 0.9). DOHS- 2 shows the results for a scenario where users do not provide any classification (β 1 = 1, β 2 = 1). As the DOHS algorithm provides parameters to adapt to different resource environments and job characteristics it was configured with the following parameters: α 1 = 1, α 2 = 1.5, α 3 = 5 and α 4 = 50. V. RELATED WORK Recently, various systems and approaches to grid-based data mining and data-intensive scheduling have been reported in the literature. Some of those that are particularly relevant to this work are briefly reviewed here. The GridBus resource broker [15] provides functions for scheduling data- and compute-intensive applications. In combination with the Storage Resource Broker[16] GridBus is able schedule data-intensive jobs based on various different metrics, including network bandwidth and utilization. The GridBus scheduler as most heutistic Grid schedulers, including the DIANA scheduler [9], follows the common separation between storage and compute resources, requires detailed information about the jobs and the environment and also assumes that transfer overhead within an organization is zero. 1E ct DOHS-1 DOHS-2 t c tg tr (a) j J C j K j Figure 2. Sum CompletionTime (b) j J K j vs j J C j Scheduling simulation results. Another class of Grid schedulers uses genetic algorithms to solve the data-intensive scheduling problem [17][18]. The main disadvantages of these approaches are the computational complexity of the genetic algorithm and the requirement to have detailed information about the environment. Hadoop [5] is the most well known open source implementation of Google s MapReduce paradigm. Hadoop s MapReduce framework is build on top of the Hadoop distributed file system (HDFS) containing all data to be mined. The map and reduce functions are typically written in Java, but also executables can be integrated via a streaming mechanism. MapReduce frameworks like Hadoop do not offer the functionality to efficiently execute computeintensive applications on a cluster, making them unsuitable for a general-purpose data mining system. Hadoop On Demand in combination with the SUN Grid Engine try to overcome these limitations by running Hadoop on top of CT DOHS T C TG TR

6 a cluster management system, thus adding another layer of complexity. Still, the resources to use for MapReduce are reserved exclusively for Hadoop and can not be used by other compute-intensive jobs. Hadoop and similar MapReduce frameworks simplify the development and deployment of data-intensive applications on local clusters and cloud resources but are currently not suited for large-scale, heterogeneous environments comprised of multiple independent organizations. Anteater [1] is a web-service-based system to handle large data sets and high computational loads. Anteater applications have to be implemented in a filter-stream structure. This processing concept and its capability to distribute finegrained parallel task make it a highly scalable system. Due to the restriction on a filter-stream structure Anteater shares some downsides of MapReduce frameworks: Applications have to be ported to this platform which makes it almost impossible to integrate existing applications. VI. CONCLUSION In this article we introduced a multi-objective scheduling algorithm for data-intensive applications in Grid environments. The new concept of combined Grid resources in combination with the developed data location aware scheduling algorithm provides an infrastructure to build scalable dataintensive applications in worldwide, heterogeneous environments. The scheduling algorithm also supports computeintensive applications so that a single environment can be used for both data- and compute-intensive applications. In addition the DOHS algorithm is specifically designed for Grid environments with Cloud resources where information is generally scarce. The simulation results show that the algorithm is competitive or even surpasses current Grid schedulers requiring detailed information. Future work may focus on additional Cloud related topics such as setup time of a resource or dynamic cost. REFERENCES [1] D. Guedes, W. Meira, and R. Ferreira, Anteater: A serviceoriented architecture for high-performance data mining, IEEE Internet Computing, vol. 10, no. 4, pp , [2] S. Datta, K. Bhaduri, C. Giannella, and H. Kargupta, Distributed data mining in peer-to-peer networks, IEEE Internet Computing, vol. 10, no. 4, pp , [3] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, 2004, pp [Online]. Available: [4] S. Ghemawat, H. Gobioff, and S. T. Leung, The google file system, SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp , [5] T. White, Hadoop: The Definitive Guide, 1st ed. O Reilly Media, [6] V. Stankovski, M. Swain, V. Kravtsov, T. Niessen, D. Wegener, M. Röhm, J. Trnkoczy, M. May, J. Franke, A. Schuster, and W. Dubitzky, Digging deep into the data mine with datamininggrid, IEEE Internet Computing, vol. 12, no. 6, pp , [7] A. Congiusta, D. Talia, and P. Trunfio, Distributed data mining services leveraging wsrf, Future Generation Computer Systems, vol. 23, no. 1, pp , [8] B. Peter and W. Alexander, Grid-aware approach to data statistics, data understanding and data preprocessing, International Journal of High performance Computing and Networking, vol. 1, no. 6, pp , [9] R. McClatchey, A. Anjum, H. Stockinger, A. Ali, I. Willers, and M. Thomas, Data intensive and network aware (diana) grid scheduling, Journal of Grid Computing, vol. 5, pp , [10] S. Venugopal and R. Buyya, A set coverage-based mapping heuristic for scheduling distributed data-intensive applications on global grids, in In proceedings of the 7th IEEE/ACM International Conference on Grid Computing(Grid06. IEEE CS press, [11] K. Ranganathan and I. Foster, Decoupling computation and data scheduling in distributed data-intensive applications, in HPDC 02: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing. IEEE Computer Society, 2002, pp [12] M. Röhm, M. Grabert, and F. Schweiggert, A generalized mapreduce approach for efficient mining of large data sets in the grid, in 1. International Conference on Cloud Computing, GRIDs, and Virtualization, CLOUD COMPUTING 2010, Lisbon, Portugal, 2010, pp [13] S. G. Akl and F. Dong, Scheduling algorithms for grid computing: State of the art and open problems, School of Computing Queen s University, Kingston, Ontario, Canada, Tech. Rep , January [Online]. Available: [14] M. Röhm, M. Grabert, and F. Schweiggert, An integrated approach for data- and compute-intensive mining of large data sets in the grid, International Journal On Advances in Intelligent Systems. [15] S. Venugopal, R. Buyya, and L. Winton, A grid service broker for scheduling e-science applications on global data grids, Concurrency and Computation: Practice and Experience, vol. 18, pp , [16] A. Rajasekar, M. Wan, and R. Moore, Mysrb & srb: Components of a data grid, in HPDC, 2002, pp [17] T. Phan, K. Ranganathan, and R. Sion, Evolving toward the perfect schedule: Co-scheduling job assignments and data replication in wide-area systems using a genetic algorithm, in JSSPP, 2005, pp [18] A. K. M. K. A. Talukder, M. Kirley, and R. Buyya, Multiobjective differential evolution for scheduling workflow applications on global grids, Concurr. Comput. : Pract. Exper., vol. 21, no. 13, pp , Sep

A Generalized MapReduce Approach for Efficient mining of Large data Sets in the GRID

A Generalized MapReduce Approach for Efficient mining of Large data Sets in the GRID A Generalized MapReduce Approach for Efficient mining of Large data Sets in the GRID Matthias Röhm, Matthias Grabert and Franz Schweiggert Institute of Applied Information Processing Ulm University Ulm,

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Load Balancing Algorithm over a Distributed Cloud Network

Load Balancing Algorithm over a Distributed Cloud Network Load Balancing Algorithm over a Distributed Cloud Network Priyank Singhal Student, Computer Department Sumiran Shah Student, Computer Department Pranit Kalantri Student, Electronics Department Abstract

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK DISTRIBUTED FRAMEWORK FOR DATA MINING AS A SERVICE ON PRIVATE CLOUD RUCHA V. JAMNEKAR

More information

A Data-Aware Resource Broker for Data Grids

A Data-Aware Resource Broker for Data Grids A Data-Aware Resource Broker for Data Grids Huy Le, Paul Coddington, and Andrew L. Wendelborn School of Computer Science, University of Adelaide Adelaide, SA 5005, Australia {paulc,andrew}@cs.adelaide.edu.au

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Advanced Database Systems

Advanced Database Systems Lecture II Storage Layer Kyumars Sheykh Esmaili Course s Syllabus Core Topics Storage Layer Query Processing and Optimization Transaction Management and Recovery Advanced Topics Cloud Computing and Web

More information

Nowadays data-intensive applications play a

Nowadays data-intensive applications play a Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted

More information

Introduction Distributed Systems

Introduction Distributed Systems Introduction Distributed Systems Today Welcome Distributed systems definition, goals and challenges What is a distributed system? Very broad definition Collection of components, located at networked computers,

More information

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Yuval Carmel Tel-Aviv University Advanced Topics in Storage Systems - Spring 2013 Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords

More information

QADR with Energy Consumption for DIA in Cloud

QADR with Energy Consumption for DIA in Cloud Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

TOWARD ENABLING SERVER-CENTRIC NETWORKS

TOWARD ENABLING SERVER-CENTRIC NETWORKS TOWARD ENABLING SERVER-CENTRIC NETWORKS A DISSERTATION SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY OF SCIENCE AND ENGINEERING 2016 WRITTEN BY: BRIAN RAMPRASAD

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)

More information

Double Threshold Based Load Balancing Approach by Using VM Migration for the Cloud Computing Environment

Double Threshold Based Load Balancing Approach by Using VM Migration for the Cloud Computing Environment www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 1 January 2015, Page No. 9966-9970 Double Threshold Based Load Balancing Approach by Using VM Migration

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

HETEROGENEOUS COMPUTING

HETEROGENEOUS COMPUTING HETEROGENEOUS COMPUTING Shoukat Ali, Tracy D. Braun, Howard Jay Siegel, and Anthony A. Maciejewski School of Electrical and Computer Engineering, Purdue University Heterogeneous computing is a set of techniques

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Functional Requirements for Grid Oriented Optical Networks

Functional Requirements for Grid Oriented Optical Networks Functional Requirements for Grid Oriented Optical s Luca Valcarenghi Internal Workshop 4 on Photonic s and Technologies Scuola Superiore Sant Anna Pisa June 3-4, 2003 1 Motivations Grid networking connection

More information

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort

More information

Optimizing and Managing File Storage in Windows Environments

Optimizing and Managing File Storage in Windows Environments Optimizing and Managing File Storage in Windows Environments A Powerful Solution Based on Microsoft DFS and Virtual File Manager September 2006 TR-3511 Abstract The Microsoft Distributed File System (DFS)

More information

A Heuristic Based Load Balancing Algorithm

A Heuristic Based Load Balancing Algorithm International Journal of Computational Engineering & Management, Vol. 15 Issue 6, November 2012 www..org 56 A Heuristic Based Load Balancing Algorithm 1 Harish Rohil, 2 Sanjna Kalyan 1,2 Department of

More information

SDS: A Scalable Data Services System in Data Grid

SDS: A Scalable Data Services System in Data Grid SDS: A Scalable Data s System in Data Grid Xiaoning Peng School of Information Science & Engineering, Central South University Changsha 410083, China Department of Computer Science and Technology, Huaihua

More information

Resource CoAllocation for Scheduling Tasks with Dependencies, in Grid

Resource CoAllocation for Scheduling Tasks with Dependencies, in Grid Resource CoAllocation for Scheduling Tasks with Dependencies, in Grid Diana Moise 1,2, Izabela Moise 1,2, Florin Pop 1, Valentin Cristea 1 1 University Politehnica of Bucharest, Romania 2 INRIA/IRISA,

More information

Preliminary Research on Distributed Cluster Monitoring of G/S Model

Preliminary Research on Distributed Cluster Monitoring of G/S Model Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 860 867 2012 International Conference on Solid State Devices and Materials Science Preliminary Research on Distributed Cluster Monitoring

More information

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Chapter 3. Design of Grid Scheduler. 3.1 Introduction Chapter 3 Design of Grid Scheduler The scheduler component of the grid is responsible to prepare the job ques for grid resources. The research in design of grid schedulers has given various topologies

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Online Optimization of VM Deployment in IaaS Cloud

Online Optimization of VM Deployment in IaaS Cloud Online Optimization of VM Deployment in IaaS Cloud Pei Fan, Zhenbang Chen, Ji Wang School of Computer Science National University of Defense Technology Changsha, 4173, P.R.China {peifan,zbchen}@nudt.edu.cn,

More information

CS Project Report

CS Project Report CS7960 - Project Report Kshitij Sudan kshitij@cs.utah.edu 1 Introduction With the growth in services provided over the Internet, the amount of data processing required has grown tremendously. To satisfy

More information

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Modified by: Dr. Ramzi Saifan Definition of a Distributed System (1) A distributed

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

A QoS Load Balancing Scheduling Algorithm in Cloud Environment

A QoS Load Balancing Scheduling Algorithm in Cloud Environment A QoS Load Balancing Scheduling Algorithm in Cloud Environment Sana J. Shaikh *1, Prof. S.B.Rathod #2 * Master in Computer Engineering, Computer Department, SAE, Pune University, Pune, India # Master in

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

Data Center Services and Optimization. Sobir Bazarbayev Chris Cai CS538 October

Data Center Services and Optimization. Sobir Bazarbayev Chris Cai CS538 October Data Center Services and Optimization Sobir Bazarbayev Chris Cai CS538 October 18 2011 Outline Background Volley: Automated Data Placement for Geo-Distributed Cloud Services, by Sharad Agarwal, John Dunagan,

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

Available online at ScienceDirect. Procedia Computer Science 93 (2016 )

Available online at   ScienceDirect. Procedia Computer Science 93 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 93 (2016 ) 269 275 6th International Conference On Advances In Computing & Communications, ICACC 2016, 6-8 September 2016,

More information

Vortex Whitepaper. Simplifying Real-time Information Integration in Industrial Internet of Things (IIoT) Control Systems

Vortex Whitepaper. Simplifying Real-time Information Integration in Industrial Internet of Things (IIoT) Control Systems Vortex Whitepaper Simplifying Real-time Information Integration in Industrial Internet of Things (IIoT) Control Systems www.adlinktech.com 2017 Table of Contents 1. Introduction........ P 3 2. Iot and

More information

Effective Load Balancing in Grid Environment

Effective Load Balancing in Grid Environment Effective Load Balancing in Grid Environment 1 Mr. D. S. Gawande, 2 Mr. S. B. Lanjewar, 3 Mr. P. A. Khaire, 4 Mr. S. V. Ugale 1,2,3 Lecturer, CSE Dept, DBACER, Nagpur, India 4 Lecturer, CSE Dept, GWCET,

More information

MANAGEMENT AND PLACEMENT OF REPLICAS IN A HIERARCHICAL DATA GRID

MANAGEMENT AND PLACEMENT OF REPLICAS IN A HIERARCHICAL DATA GRID MANAGEMENT AND PLACEMENT OF REPLICAS IN A HIERARCHICAL DATA GRID Ghalem Belalem 1 and Bakhta Meroufel 2 1 Department of Computer Science, Faculty of Sciences, University of Oran (Es Senia), Algeria ghalem1dz@gmail.com

More information

L3/L4 Multiple Level Cache concept using ADS

L3/L4 Multiple Level Cache concept using ADS L3/L4 Multiple Level Cache concept using ADS Hironao Takahashi 1,2, Hafiz Farooq Ahmad 2,3, Kinji Mori 1 1 Department of Computer Science, Tokyo Institute of Technology 2-12-1 Ookayama Meguro, Tokyo, 152-8522,

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing

Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing Ms. More Reena S 1, Prof.Nilesh V. Alone 2 Department of Computer Engg, University of Pune

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2 International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1,

More information

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,

More information

Introduction to Distributed Systems (DS)

Introduction to Distributed Systems (DS) Introduction to Distributed Systems (DS) INF5040/9040 autumn 2009 lecturer: Frank Eliassen Frank Eliassen, Ifi/UiO 1 Outline What is a distributed system? Challenges and benefits of distributed system

More information

Introduction to Distributed Systems. INF5040/9040 Autumn 2018 Lecturer: Eli Gjørven (ifi/uio)

Introduction to Distributed Systems. INF5040/9040 Autumn 2018 Lecturer: Eli Gjørven (ifi/uio) Introduction to Distributed Systems INF5040/9040 Autumn 2018 Lecturer: Eli Gjørven (ifi/uio) August 28, 2018 Outline Definition of a distributed system Goals of a distributed system Implications of distributed

More information

An agent-based peer-to-peer grid computing architecture

An agent-based peer-to-peer grid computing architecture University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2005 An agent-based peer-to-peer grid computing architecture J. Tang University

More information

Distributed System Framework for Mobile Cloud Computing

Distributed System Framework for Mobile Cloud Computing Bonfring International Journal of Research in Communication Engineering, Vol. 8, No. 1, February 2018 5 Distributed System Framework for Mobile Cloud Computing K. Arul Jothy, K. Sivakumar and M.J. Delsey

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

PACM: A Prediction-based Auto-adaptive Compression Model for HDFS. Ruijian Wang, Chao Wang, Li Zha

PACM: A Prediction-based Auto-adaptive Compression Model for HDFS. Ruijian Wang, Chao Wang, Li Zha PACM: A Prediction-based Auto-adaptive Compression Model for HDFS Ruijian Wang, Chao Wang, Li Zha Hadoop Distributed File System Store a variety of data http://popista.com/distributed-filesystem/distributed-file-system:/125620

More information

An Energy Efficient and Delay Aware Data Collection Protocol in Heterogeneous Wireless Sensor Networks A Review

An Energy Efficient and Delay Aware Data Collection Protocol in Heterogeneous Wireless Sensor Networks A Review Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 5, May 2015, pg.934

More information

Hierarchical Addressing and Routing Mechanisms for Distributed Applications over Heterogeneous Networks

Hierarchical Addressing and Routing Mechanisms for Distributed Applications over Heterogeneous Networks Hierarchical Addressing and Routing Mechanisms for Distributed Applications over Heterogeneous Networks Damien Magoni Université Louis Pasteur LSIIT magoni@dpt-info.u-strasbg.fr Abstract. Although distributed

More information

Lecture 1: January 22

Lecture 1: January 22 CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON THE CATEGORY AND POPULARITY

DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON THE CATEGORY AND POPULARITY Software Metric Trends And Evolution, B Venkata Ramana, Dr.G.Narasimha, Journal Impact Factor DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON (2015): 8.9958 (Calculated by GISI) www.jifactor.com THE

More information

QoS Guided Min-Mean Task Scheduling Algorithm for Scheduling Dr.G.K.Kamalam

QoS Guided Min-Mean Task Scheduling Algorithm for Scheduling Dr.G.K.Kamalam International Journal of Computer Communication and Information System(IJJCCIS) Vol 7. No.1 215 Pp. 1-7 gopalax Journals, Singapore available at : www.ijcns.com ISSN: 976 1349 ---------------------------------------------------------------------------------------------------------------------

More information

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi Journal of Energy and Power Engineering 10 (2016) 405-410 doi: 10.17265/1934-8975/2016.07.004 D DAVID PUBLISHING Shirin Abbasi Computer Department, Islamic Azad University-Tehran Center Branch, Tehran

More information

Introduction to Distributed Systems (DS)

Introduction to Distributed Systems (DS) Introduction to Distributed Systems (DS) INF5040/9040 autumn 2014 lecturer: Frank Eliassen Frank Eliassen, Ifi/UiO 1 Outline Ø What is a distributed system? Ø Challenges and benefits of distributed systems

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Auto Management for Apache Kafka and Distributed Stateful System in General

Auto Management for Apache Kafka and Distributed Stateful System in General Auto Management for Apache Kafka and Distributed Stateful System in General Jiangjie (Becket) Qin Data Infrastructure @LinkedIn GIAC 2017, 12/23/17@Shanghai Agenda Kafka introduction and terminologies

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

Request for Comments: 1787 T.J. Watson Research Center, IBM Corp. Category: Informational April 1995

Request for Comments: 1787 T.J. Watson Research Center, IBM Corp. Category: Informational April 1995 Network Working Group Y. Rekhter Request for Comments: 1787 T.J. Watson Research Center, IBM Corp. Category: Informational April 1995 Status of this Memo Routing in a Multi-provider Internet This memo

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Big Data Using Hadoop

Big Data Using Hadoop IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for

More information

NEW MODEL OF FRAMEWORK FOR TASK SCHEDULING BASED ON MOBILE AGENTS

NEW MODEL OF FRAMEWORK FOR TASK SCHEDULING BASED ON MOBILE AGENTS NEW MODEL OF FRAMEWORK FOR TASK SCHEDULING BASED ON MOBILE AGENTS 1 YOUNES HAJOUI, 2 MOHAMED YOUSSFI, 3 OMAR BOUATTANE, 4 ELHOCEIN ILLOUSSAMEN Laboratory SSDIA ENSET Mohammedia, University Hassan II of

More information

Paradigm Shift of Database

Paradigm Shift of Database Paradigm Shift of Database Prof. A. A. Govande, Assistant Professor, Computer Science and Applications, V. P. Institute of Management Studies and Research, Sangli Abstract Now a day s most of the organizations

More information

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud 571 Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud T.R.V. Anandharajan 1, Dr. M.A. Bhagyaveni 2 1 Research Scholar, Department of Electronics and Communication,

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

SAS and Grid Computing Maximize Efficiency, Lower Total Cost of Ownership Cheryl Doninger, SAS Institute, Cary, NC

SAS and Grid Computing Maximize Efficiency, Lower Total Cost of Ownership Cheryl Doninger, SAS Institute, Cary, NC Paper 227-29 SAS and Grid Computing Maximize Efficiency, Lower Total Cost of Ownership Cheryl Doninger, SAS Institute, Cary, NC ABSTRACT IT budgets are declining and data continues to grow at an exponential

More information

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading Mario Almeida, Liang Wang*, Jeremy Blackburn, Konstantina Papagiannaki, Jon Crowcroft* Telefonica

More information

A Time-To-Live Based Reservation Algorithm on Fully Decentralized Resource Discovery in Grid Computing

A Time-To-Live Based Reservation Algorithm on Fully Decentralized Resource Discovery in Grid Computing A Time-To-Live Based Reservation Algorithm on Fully Decentralized Resource Discovery in Grid Computing Sanya Tangpongprasit, Takahiro Katagiri, Hiroki Honda, Toshitsugu Yuba Graduate School of Information

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

2. LITERATURE REVIEW. Performance Evaluation of Ad Hoc Networking Protocol with QoS (Quality of Service)

2. LITERATURE REVIEW. Performance Evaluation of Ad Hoc Networking Protocol with QoS (Quality of Service) 2. LITERATURE REVIEW I have surveyed many of the papers for the current work carried out by most of the researchers. The abstract, methodology, parameters focused for performance evaluation of Ad-hoc routing

More information

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD S.THIRUNAVUKKARASU 1, DR.K.P.KALIYAMURTHIE 2 Assistant Professor, Dept of IT, Bharath University, Chennai-73 1 Professor& Head, Dept of IT, Bharath

More information

Agent Mobility. Overview. mobile agents and mobile computing. technical issues agent languages, distributed execution, environment, security

Agent Mobility. Overview. mobile agents and mobile computing. technical issues agent languages, distributed execution, environment, security Agent Mobility Overview mobile agents and mobile computing technical issues agent languages, distributed execution, environment, security multi-agent systems cooperation between agents to solve a task

More information

A Cloud Framework for Big Data Analytics Workflows on Azure

A Cloud Framework for Big Data Analytics Workflows on Azure A Cloud Framework for Big Data Analytics Workflows on Azure Fabrizio MAROZZO a, Domenico TALIA a,b and Paolo TRUNFIO a a DIMES, University of Calabria, Rende (CS), Italy b ICAR-CNR, Rende (CS), Italy Abstract.

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE

A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE Chin-Chao Huang, Wenching Liou National Chengchi University, Taiwan 99356015@nccu.edu.tw, w_liou@nccu.edu.tw

More information

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop Myoungjin Kim 1, Seungho Han 1, Jongjin Jung 3, Hanku Lee 1,2,*, Okkyung Choi 2 1 Department of Internet and Multimedia Engineering,

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Available online at ScienceDirect. Procedia Computer Science 56 (2015 )

Available online at  ScienceDirect. Procedia Computer Science 56 (2015 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 56 (2015 ) 266 270 The 10th International Conference on Future Networks and Communications (FNC 2015) A Context-based Future

More information

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...

More information

Remote Direct Storage Management for Exa-Scale Storage

Remote Direct Storage Management for Exa-Scale Storage , pp.15-20 http://dx.doi.org/10.14257/astl.2016.139.04 Remote Direct Storage Management for Exa-Scale Storage Dong-Oh Kim, Myung-Hoon Cha, Hong-Yeon Kim Storage System Research Team, High Performance Computing

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Grid Computing Systems: A Survey and Taxonomy

Grid Computing Systems: A Survey and Taxonomy Grid Computing Systems: A Survey and Taxonomy Material for this lecture from: A Survey and Taxonomy of Resource Management Systems for Grid Computing Systems, K. Krauter, R. Buyya, M. Maheswaran, CS Technical

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Multi-Criteria Strategy for Job Scheduling and Resource Load Balancing in Cloud Computing Environment

Multi-Criteria Strategy for Job Scheduling and Resource Load Balancing in Cloud Computing Environment Indian Journal of Science and Technology, Vol 8(30), DOI: 0.7485/ijst/205/v8i30/85923, November 205 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Multi-Criteria Strategy for Job Scheduling and Resource

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

An improved MapReduce Design of Kmeans for clustering very large datasets

An improved MapReduce Design of Kmeans for clustering very large datasets An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

Mapping a group of jobs in the error recovery of the Grid-based workflow within SLA context

Mapping a group of jobs in the error recovery of the Grid-based workflow within SLA context Mapping a group of jobs in the error recovery of the Grid-based workflow within SLA context Dang Minh Quan International University in Germany School of Information Technology Bruchsal 76646, Germany quandm@upb.de

More information