Cooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop

Size: px
Start display at page:

Download "Cooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop"

Transcription

1 Cooperation between Data ing and Simulation ing for Performance Analysis of Hadoop Byeong Soo Kim and Tag Gon Kim Department of Electrical Engineering Korea Advanced Institute of Science and Technology Daejeon, Republic of Korea {kevinzzang, Abstract Performance analysis of a complex computing system requires a lot of time and effort. Many studies related to performance analysis have been conducted. Firstly, there are studies about workload modeling that model and analyze the observed workloads using statistical summaries. Also, there are studies to analyze system behavior and procedures through simulation modeling. However, there are some disadvantages to using only one approach for analyzing the complex system more accurately. Therefore, this paper presents cooperation between data modeling and simulation modeling for performance analysis of the system. The target system, Hadoop, is one of the representative big data platforms that can demonstrate the complexity of the analysis. Firstly, we identify the characteristics of Hadoop and divide them into two parts according to their characteristics. Then, we model and integrate them using each modeling approach. This paper presents the experiment to show the advantages of the cooperative modeling in the accuracy, execution time, and modeling extensibility. Keywords cooperative modeling; performance analysis; data modeling; simulation modeling; Hadoop I. INTRODUCTION Performance analysis of a complex computing system requires a deep understanding of the system. The larger and more complex the system, the more cost and effort are needed for the performance analysis. Its results are very important because they help to predict the future behavior of the computing system and maximize the performance. Furthermore, such results can be used in resource planning, parameter tuning, and so on [1]. Meanwhile, demands for big data computing platforms such as Hadoop are explosive in the big data era. Hadoop is an open-source software framework used for distributed storage and processing of large data sets [2]. Hadoop consists of MapReduce and Hadoop Distributed File System (). While it is efficient and reliable for dataintensive computing, there are excessive parameters to configure a Hadoop cluster for efficient execution. Also, it is difficult to set up a physical Hadoop cluster to evaluate the scalability of an application up to one thousand nodes. Therefore, performance analysis of Hadoop is one of the most important issues to decide on a set of optimal parameters for a good performance. When modeling such a system for the performance analysis, it can be classified into a workload model and a system model, as shown in Figure 1. Generally, workload means the amount of work assigned to the system in a given time period [3]. Workload modeling is to create a statistical summary of these workloads through observation of the system. It can be applied to all the workload attributes such as CPU usage, memory usage, I/O behavior, and network traffic. Workload modeling can provide the ability to change model parameters and reduce the file size compared to the normal workload traces. System model means a conceptual model as a result of system modeling that describes and represents the structure, process, and characteristics of the system [4]. Depending on the abstraction level of the system or the purpose of the analysis, we can perform the performance analysis of a Hadoop system through all models or through each model. The performance analysis of a Hadoop framework has been addressed previously. Firstly, some research has been done on the performance analysis of Hadoop using workload models [5, 6]. It can be called data modeling using observed data (trace). They performed statistical analysis and modeling using real workload traces. They extracted features about jobs from the trace and generated realistic synthetic workloads used in prediction. The model predicts the workload patterns well. But it only considers the map and reduce performance, and it does not reflect platform models that include cluster information or hardware. It can be seen as a high abstraction level of MapReduce. On the other hand, there are some existing Hadoop simulators made using system knowledge [7]. This approach can be called simulation modeling using prior knowledge. HSim [8] and MRPerf [9] are representative simulators. They can simulate the dynamic behaviors of Hadoop clusters. Also, they can configure many Hadoop parameters, including hardware and cluster parameters. However, they did not consider the workload model built using data modeling. For example, the characteristics of each Hadoop application and disk I/O model that are difficult to model with low-level knowledge are simply reflected to these simulators. This can be detrimental to prediction accuracy and scalability in performance analysis. SummerSim-SPECTS, 2017 July 9-12, Bellevue, Washington, USA; 2017 Society for ing & Simulation International (SCS)

2 Real Workload Trace (Observed Data) Workload (Data ing) split1 Execution Analysis Workload (Data ing) Hadoop Framework for Performance Analysis System (Simulation ing) Complementary Cooperation Fig. 1. Concept of proposed cooperative model split2 split3 Map1 Map2 Map3 sort shuffle merge Observation Abstraction Structure & Process (Prior Knowledge) System (Simulation ing) In order to overcome the disadvantages of each approach, we need a way to get enhanced results through cooperation between data modeling of the workload model using actual data and simulation modeling of the system model using knowledge of the system components. In this paper, we propose a cooperative modeling of two modeling approaches for the performance analysis of Hadoop. We firstly perform a conceptual modeling and a partitioning according to the characteristics of Hadoop components. Then, we model each partitioned model using two modeling methods and then integrate them. Because we maximize the benefits of each modeling method, it is possible to improve prediction accuracy and scalability. This paper is organized as follows: Background knowledge about the Hadoop and each modeling method are briefly introduced. Then our proposed work about cooperation of two modeling methods are described. Finally, experiments are provided to show the contributions of the work. II. PRELIMINARIES A. Hadoop Overview Hadoop is a representative big data platform for reliable, scalable, and large-scale distributed computing [4]. MapReduce is a computing framework for large-scale distributed data processing based on the divide and conquer paradigm. It works by breaking the processing into map and reduce [10]. The MapReduce framework executes the map and reduce in parallel on the different machines within the Hadoop cluster. Map performs data filtering and sorting, and reduce performs summary operations. We can define map and reduce functions as well as types of input and output. Figure 2 shows the concept of MapReduce. Fig. 2. Concept of MapReduce Reduce1 Reduce2 result1 result2 is a distributed file system of Hadoop that stores data reliably using commodity clusters [11]. Input data stored on the are split into fixed-size blocks, and each block is allocated to a map task. The map processes each key-value pair in the block and outputs the result as a list of key-value pairs. Then, the output of the map is partitioned by the key, and they are transferred to each reduce, respectively. This process is called shuffle. The gathered records are merged and sorted at each reduce task of the node. The user-specified reduce reads and processes key-value pairs sequentially. Finally, the outputs of the reduce are written to the. B. Data ing and Simulation ing Data modeling is used to build models for complementing theory-based simulation models. A data modeling is used to determine the correlation between a system s inputs and outputs using a training data set that is representative of all the behaviors found in the system [12]. Once the model is learned, it can be tested using a data set to determine how well it can generalize to unseen data. Data modeling consists of the acquisition, modeling, validation, and prediction process. It has seen wide use in various fields, including science, engineering, economic, industries, and others in order to predict the future behavior of the system [12]. On the other hand, simulation modeling is a knowledge-based approach generally used in the simulation field. To build a model, it uses theories, physical laws, operational laws, and so on. Because the theory means a statement of what causes what and why, it is possible to clearly represent the causality between a set of inputs and outputs of the system contrary to the data modeling [13]. Data Real System (Data ) Data X Data Y Data X Predicted Ypre Training the model : Minimizing error ( Y Ypre ) <Data ing> Training Input X Knowledge Input X Real System (Sim. ) Abstract Building the model => : Abstraction of system <Simulation ing> Fig. 3. Definition of data modeling and simulation modeling Output Y Predicted Output Ypre They each have pros and cons. First, the simulation model enables a higher level of analysis, such as a prescriptive analysis through causal relationships, but the data model generally remains in predictive analysis through correlations between variables [13]. Also, simulation model is possible to represent the dynamic map of input and state to output, but the data model represents only a static map of the input variable to the output variable. For a valid prediction, the system structure should remain unchanged before and after training in case of data modeling. It is also difficult to reflect a system with an abnormal or non-existing system. Simulation modeling, on the other hand, requires sufficient knowledge of the system for a valid prediction, and it can be difficult to predict accurately due to various assumptions or constraints of the knowledge. In the next section, we present a cooperative modeling of Hadoop considering the features of two modeling methods.

3 III. PROPOSED COOPERATIVE HADOOP MODEL The proposed work is divided into four parts. The first one is conceptual modeling that identifies the overall characteristics of the Hadoop system. The second and third parts deal with a detailed description of using each modeling method: data modeling and simulation modeling. The last part presents an integration and implements a data model and simulation model. A. Conceptual ing: Overall Structure To model the Hadoop system, one must build a conceptual model that expresses the structure, abstraction level, and system elements according to the objectives of the analysis. The conceptual model should be partitioned into two models (data model and simulation model) according to the objective of the analysis and the acquisition level of data/knowledge. In the Hadoop, it can be divided into two types of models: workload model and system model. A workload model consists of an application model and a disk I/O model. And a system model consists of a MapReduce model, an model, and a platform model. The system model can be represented as the simulation model because knowledge about them can be obtained sufficiently. On the other hand, the workload model can be represented as the data model with the acquisition of environmental data. The classification of models may vary depending on the purpose of the analysis. Table 1 shows the model partitioning and description of each model. These partitioned models can be modeled using each modeling approach as follows: Workload System TABLE I. Disk I/O MapReduce Platform MODEL CLASSIFICATION OF HADOOP Description / Characteristics - Hadoop application program (WordCount, TeraSort, TestDFSIO, etc.) - Difficult to learn internal operations - Requiring many assumptions for modeling - Storage model for file write, read, and shuffle - Requiring low-level knowledge of storages - Possible to use existing simulators - MapReduce framework (load->sdf->sdf->sdf) - Enough knowledge for system modeling - Need to reflect elements after modeling (parameters, algorithms) - Operation of name node & data node - Structure of distributed file system - Data replacement algorithms - Enough knowledge for system modeling - Need to reflect elements after modeling - Structure of platform - Hardware model, including network - Fickle coupling relationships among master node and slave nodes B. Part of Data ing: Workload The workload model of Hadoop consists of an application model and a disk I/O model. The application model describes the Hadoop application program, for example, WordCount, TeraSort, TestDFSIO, and so on. It is required to understand their internal operation mechanisms, including the hardware performance to model such an application. But they are very complex and require a low level of knowledge. Also, because the simulation modeling of them demands a lot of assumptions, it can cause a drop in accuracy. Therefore, it is more appropriate to use the data modeling method in the application model than to use the simulation modeling method. The disk I/O model is similar to the application model. The modeling of disk I/O requires low-level knowledge of the storage system. It is possible to use existing simulators like DiskSim [14], but it can overload the simulation time or resource, which does not fit the purpose of the simulation. So the disk I/O model can also be created through the data modeling. In this paper, we use Artificial Neural Networks (ANNs) to perform data modeling (Figure 4). ANN is one of the representative data modeling approaches, which is inspired by biological neural networks of the human brain. It is composed of a large number of highly interconnected neurons. ANN models are made by training the network to represent the relationships and processes that are inherent within the data [15]. During the training, the strengths of neuron connection (called weights) are changed in order to calibrate the model. Input Layer Hidden Layer Output Layer I i W ij H j W jk O k in1 in2 in3 Fig. 4. Artificial Neural Networks (ANNs) Sets (Configurations) (WordCount / TeraSort) #of Nodes Size of Input Data # of s # of Files / Chunk Size #of Nodes Size of Input Data # of Files Chunk Size Data ing Training with ANNs: Minimizing error ( Y Ypre ) Workload 1 : Neural Network Workload 2 : I/O Neural Network Reducer Map Reduce Fig. 5. Process and result of data modeling using ANNs out1 out2 out3 Outputs (Hadoop execution results) Output Parameter (WordCount / TeraSort) SizeRatio ProcTime Variance SortSizeRatio SortProcTime ReducerSizeRatio ReducerProcTime Variance Output Parameter Shuffle Time per Node (sec/mb) Write (MB/sec) Read (MB/sec) To do data modeling using ANN, firstly, we identify the input and output parameters of the target models. Then, we collect and extract the environmental data from executions of

4 Hadoop application to use them as training data. After that, each data model of application and disk I/O is created through a training process using the acquired data set. Figure 5 presents the input/output parameter of each model. For the data modeling, we use the Lavenberg-Marquardt optimization technique as a learning algorithm [16] and mean squared error for a measurement of learning performance. C. Part of Simulation ing: System The MapReduce and models can be modeled using domain knowledge. MapReduce operations are performed through the map, shuffle, and reduce processes, and they run independently in parallel. A detailed description of each process is given in Section 2. In MapReduce, the unit of work that a client wants to perform is called a job, and it consists of input data, a MapReduce program, and configuration information. To control the job execution process, there are two kinds of nodes, including a job tracker and a few task trackers. The job tracker schedules the tasks to be performed by the task trackers so that all jobs are performed in the system as a whole. Task trackers perform each task and send progress reports to a job tracker that keeps the entire history of each job as one record. At this point, if the task fails, the job tracker reschedules it to another task tracker. operates in a master-slave fashion. has a name node as a master node and a data node as a slave node. The name node manages the namespace of the file system. It maintains a file system tree and metadata for all the files and directories in that tree. This information is persistently stored in two files on the local disk in the form of a namespace image and an edit log, and the name node knows which data nodes have all the blocks for a given file. A data node is responsible for actual operations of the file system, storing and retrieving blocks when requested by a client or a name node, and periodically reporting a list of stored blocks to the name node. The platform model includes a hardware model, such as a network model and a topology model, thus indicating connections between the clusters. System Master Node JobTracker NameNode Slave Node TaskTracker DataNode Reducer understanding the whole process is needed rather than simply data modeling through data acquisition. In this paper, we use Discrete Event System Specification (DEVS) formalism for the simulation modeling of these models [17]. It is a set theoretic specification of discrete event systems, which has been widely used for modeling many applications of science and engineering. The DEVS formalism is hierarchical, modular, and object-oriented, so it is suitable for the modeling of the system model of Hadoop. It largely consists of an atomic DEVS model representing the system behavior and a coupled DEVS model representing the structure of the system. A structure of the Hadoop system model using the DEVS formalism is shown in Figure 6. Each DEVS model for the internal components is shown in Figure 7. MasterNode Coupled NameNode Coupled Fig. 7. System model: Example of Hadoop DEVS model Input Data System Parameters (Parameters, Algorithms) JobTracker Coupled Message Switcher Atomic D. Integration and Implementation After modeling data model and simulation model, it is required to integrate them. Each model can be connected through predefined input/output relationships. They can be implemented each other in the heterogeneous environments and then interoperated using a middleware. Or they can be developed and integrated with the homogeneous environment. In this paper, we develop the Hadoop model in the same environment. The integrated model is illustrated in Figure 8. It shows the components and the connections among the models. NIC NIC Data Simulation Master Node Slave Node JobTracker DataNode Network NameNode NIC TaskTracker Client Fig. 6. Structure of Hadoop system model Disk Disk I/O I/O Client NIC LFS Client When these series of processes and structures are modeled by data modeling, these details can be highly abstracted. It makes it difficult to perform behavior analysis and structural change. Also, it is difficult to represent heterogeneous computing environments of the Hadoop platform with numerous nodes. Therefore, simulation modeling through Integrated Hadoop Network Fig. 8. Integrated Hadoop model Shuffler Reducer

5 IV. EXPERIMENT This section presents experiments of the Hadoop model made by the proposed modeling approach. It is applied to develop the Hadoop model, which cooperates with data modeling and simulation modeling approaches. To demonstrate the effectiveness of the proposed work, three experiments were designed, as shown in Table 2. The first one is an experiment to show the prediction accuracy using proposed work. We predict job completion time and throughput according to the number of data nodes and the size of input data. In the second experiment, we compare the real execution times of each simulation. The final experiment compares the model extensibility of the proposed method with those of the existing methods. A. Prediction Accuracy The most important thing that determines the performance of the simulation is the prediction accuracy of the output. Prediction accuracy can be compared through root mean squared error (RMSE). The RMSE is calculated by the difference between the real execution result and the simulation result. The smaller the value, the closer the predicted result is to the actual result. In this experiment, we compare the prediction accuracy of the proposed model with that of the control groups by simulating the job completion time according to the number of data nodes and the data size. The real execution of Hadoop was conducted on a homogeneous cluster of 16 nodes, which consisted of one master node and 15 data nodes. The parameters used in the experiment are shown in Table 4. Design Prediction Accuracy Execution Time Extension TABLE II. Objective Accuracy of output using RMSE Simulation execution time (Simulation speed) Extensibility and behavior analysis EXPERIMENTAL DESIGN X -# of data nodes -Total data size -# of data nodes -Data placement -algorithm Parameter Y -Job completion time -Simulation execution time -Job completion time TABLE IV. PARAMETERS USED IN EXPERIMENT A Value WordCount # of Map / Reduce 30 / 1 Chunk size 64 MB Total data size 0.5 ~ 16 GB # of data nodes 1 ~ 1024 For these experiments, models using only simulation modeling and data modeling are used as a control group. In the first model, which is created using only data modeling, the system model is created using the DEVS formalism in the same way as the proposed work, and the workload model is created through simulation modeling. The application model that constitutes the workload model is made into a simple simulation model through the abstraction process, and the disk model uses the existing DiskSim created by the domain expert [14]. On the other hand, the models made using only data modeling are simpler than the previous model. It is modeled at once using the entire input and output data of Hadoop, without distinction between the workload model and the system model. A detailed description of each is shown in Table 3. Job Completion Time (sec) Job Completion Time (sec) (a) # of Data Nodes Proposed Work TABLE III. MODELS USED IN THE EXPERIMENT Classification Workload System - Data modeling using ANNs - Simulation modeling using DEVS (b) Data Size (GB) Execution Result Data+Sim Data Fig. 9. Experimental Result: Prediction Accuracy Only Simulation ing Only Data ing - App. model: Abstracted - I/O : DiskSim - Simulation modeling using DEVS - Data modeling the entire system at once using ANNs TABLE V. COMPARISON OF PREDICTION ACCURACY Data + Sim. Data (a) # of Data Nodes 24.3 (lowest) (b) Data Size 86.4 (lowest) Simulation

6 Figure 9 and Table 5 show the simulation results comparing the prediction accuracy. These results show that the proposed model has a lower RMSE than the other two models. In other words, it means that the cooperative model of data modeling and simulation modeling has enhanced the prediction accuracy compared to the other models. This is applied to both the number of data nodes and the data size. In the first experiment, we can see that the data model has some minus value of job completion time. This is because it is difficult to express the boundary conditions accurately with only the input/output data used for data modeling. The simulation model shows relatively accurate predictions, which can vary depending on how the application affects the overall process of the system. Because the application model in the simulation model uses an abstracted one, the less impact the application has on the overall system. As a result, we can see that cooperation between data modeling and simulation modeling can give better prediction results than using each modeling method alone. B. Simulation Execution Time Simulation execution time is also a very important factor in performance evaluation. As the number of nodes to be simulated or the number of experimental designs increases, the simulation time exponentially increases. It causes a loss of time resources. In this experiment, we compare the simulation time according to the number of nodes using the Hadoop simulation model, which is created only by simulation modeling, data modeling, and the proposed approach. The parameters used in this experiment are shown in Table 6. TABLE VI. # of Map / Reduce 30 / 1 Chunk size Total data size PARAMETERS USED IN THE EXPERIMENT B WordCount 64 MB 2 GB # of data nodes 1 ~ 1024 Value Figure 10 shows the execution times of each simulator. In the real system, as the number of data nodes increases, the execution time decreases. On the other hand, in the simulation, the run-time of the data nodes increases as the number of nodes increases when the data size is constant. This is because, as the number of nodes increases, the computing resources required for the simulation increase. The data model has the highest simulation speed because there is no consideration of the cluster topology inside it. The number of nodes, topology, and specifications are simply abstracted numerically inside the model. The low speed of the simulation model is caused by DiskSim replacing the I/O model. The proposed model has an intermediate speed between the two models. It is slower than the data model, but it has higher accuracy than the data model. C. Extensibility Data modeling and simulation modeling generally differ in purpose and features. One of them is related to model extensibility. As discussed earlier, the simulation model can use algorithms, object models, and so on, as well as parameters as inputs. It makes it easy to perform some experiments according to the changes of system algorithms or models. However, the data model can reflect only parameter changes. It is difficult to use algorithms or object models as inputs of the data model. In order to consider them in the data model, we need to collect new data and perform the data modeling process again. Additionally, it is difficult to analyze the system behavior, such as failure analysis and topology analysis, with the data model. However, the proposed model makes them possible with high extensibility. Since the model has the advantages of simulation modeling, it can use various types of inputs. In other words, in addition to the numerical parameters, it is possible to simulate the Hadoop system by changing algorithms and object models. In this experiment, we perform the simulations by changing the data placement algorithm as input. This can show that the proposed model is more scalable than the simulation model. We use Round-Robin and capacity algorithms as the data placement algorithm. Figure 11 shows experimental results. Simulation Execution Time (min) Execution Result Data+Sim. # of Data Nodes Data Sim. Fig. 10. Experimental Result: Execution Time Fig. 11. Experimental Result: Extensibility

7 V. CONCLUSION This paper presents cooperation between data modeling and simulation modeling for performance analysis of Hadoop. There are researches related to data modeling that analyze the observed workloads using statistical summaries. Also, there are studies to simulate system behavior and procedures through simulation modeling. However, because they each have disadvantages, we complement the shortcomings through the cooperation of them. For this, firstly, we identify the characteristics of Hadoop and classify into two parts according to their characteristics. Then, we model and integrate them using each modeling approach. This paper presents three experiments to show the characteristics of the cooperative modeling: prediction accuracy, execution time and modeling extensibility. From these experiments, we can see that cooperation of two modeling methods can give better prediction results than using each modeling method. We can also see that the proposed model has advantages in execution speed over simulation modeling, and in model extensibility over data modeling. For further work, we will add other components that we did not reflect in this paper. Also, we will research about the methodology to develop the cooperative model of various systems. REFERENCES [1] L. E. B. Villalpando, A. April, and A. Abran, "Performance analysis model for big data applications in cloud computing," Journal of Cloud Computing, vol. 3, no. 1, p. 19, [2] Apache Hadoop, (last accessed: ) [3] D. G. Feitelson, "Workload modeling for performance evaluation," IFIP International Symposium on Computer Performance ing, Measurement and Evaluation, pp , Springer Berlin Heidelberg, September, [4] H. Gronniger and B. Rumpe, "Definition of the System," UML 2 Semantics and s p. 61, [5] H. Yang, Z. Luan, W. Li, and D. Qian, "MapReduce workload modeling with statistical approach," Journal of grid computing, vol. 10, no. 2, pp , [6] R. De, A. Thomas, A workload model for MapReduce, Diss. TU Delft, Delft University of Technology, [7] X. Wu, Y. Liu, and I. Gorton, Exploring performance models of Hadoop applications on cloud architecture, Proceedings of the 11th International ACM SIGSOFT Conference on Quality of Software Architectures, pp , [8] Y. Liu, M. Li, N. K. Alham, and S. Hammoud, "HSim: a MapReduce simulator in enabling cloud computing," Future Generation Computer Systems, vol. 29, no. 1, pp , [9] G. Wang, A. R. Butt, P. Pandey, and K. Gupta, "Using realistic simulation for performance analysis of mapreduce setups," Proceedings of the 1st ACM workshop on Large-Scale system and application performance, [10] J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp , [11] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop Distributed File System," 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, Incline Village, USA, May 3-7, [12] R. J. Abrahart, L. M. See, and D. P. Solomatine, Practical hydroinformatics: computational intelligence and technological developments in water applications, vol. 68, Springer Science & Business Media, [13] B. S. Kim, B. G. Kang, S. H. Choi, and T. G. Kim, "Data modeling versus simulation modeling in the big data era: case study of a greenhous control system," to appear in SIMULATION: Transaction of The Society for ing and Simulation International, [14] S. B. John, S. Jiri, W. S. Steven, R. G. Gregory, and Contributors, The DiskSim Simulation Environment Version 4.0 Reference Manual, Carnegie Mellon University, [15] K. S. Narendra and K. Parthasarathy, Identification and Control of Dynamical Systems Using Neural Networks, IEEE Transactions on Neural Networks, vol. 1, no. 1, pp 4-27, [16] D. W. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, Journal of the society for Industrial and Applied Mathematics, vol. 11, no. 2, pp , [17] B. P. Zeigler, H. Praehofer, and T. G. Kim, Theory of modeling and simulation, 2nd ed., Academic Press, 2001.

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

SQL Query Optimization on Cross Nodes for Distributed System

SQL Query Optimization on Cross Nodes for Distributed System 2016 International Conference on Power, Energy Engineering and Management (PEEM 2016) ISBN: 978-1-60595-324-3 SQL Query Optimization on Cross Nodes for Distributed System Feng ZHAO 1, Qiao SUN 1, Yan-bin

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

CLIENT DATA NODE NAME NODE

CLIENT DATA NODE NAME NODE Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Indexing Strategies of MapReduce for Information Retrieval in Big Data International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Application-Aware SDN Routing for Big-Data Processing

Application-Aware SDN Routing for Big-Data Processing Application-Aware SDN Routing for Big-Data Processing Evaluation by EstiNet OpenFlow Network Emulator Director/Prof. Shie-Yuan Wang Institute of Network Engineering National ChiaoTung University Taiwan

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

QADR with Energy Consumption for DIA in Cloud

QADR with Energy Consumption for DIA in Cloud Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster 2017 2 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: 978-1-60595-485-1 Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.

APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B. APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.Kadu PREC, Loni, India. ABSTRACT- Today in the world of

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Fast and Effective System for Name Entity Recognition on Big Data

Fast and Effective System for Name Entity Recognition on Big Data International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop Myoungjin Kim 1, Seungho Han 1, Jongjin Jung 3, Hanku Lee 1,2,*, Okkyung Choi 2 1 Department of Internet and Multimedia Engineering,

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

An improved MapReduce Design of Kmeans for clustering very large datasets

An improved MapReduce Design of Kmeans for clustering very large datasets An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

A New Model of Search Engine based on Cloud Computing

A New Model of Search Engine based on Cloud Computing A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key

More information

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS Radhakrishnan R 1, Karthik

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System

Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System L. Prasanna Kumar 1, 1 Assoc. Prof, Department of Computer Science & Engineering., Dadi Institute

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Journal homepage: www.mjret.in ISSN:2348-6953 A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Bhavsar Nikhil, Bhavsar Riddhikesh,Patil Balu,Tad Mukesh Department of Computer Engineering JSPM s

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

SMCCSE: PaaS Platform for processing large amounts of social media

SMCCSE: PaaS Platform for processing large amounts of social media KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and

More information

Decision analysis of the weather log by Hadoop

Decision analysis of the weather log by Hadoop Advances in Engineering Research (AER), volume 116 International Conference on Communication and Electronic Information Engineering (CEIE 2016) Decision analysis of the weather log by Hadoop Hao Wu Department

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Calhoun: The NPS Institutional Archive Faculty and Researcher Publications Faculty and Researcher Publications 2013-03 A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the

More information

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI 2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing

More information

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN 68 Improving Access Efficiency of Small Files in HDFS Monica B. Bisane, Student, Department of CSE, G.C.O.E, Amravati,India, monica9.bisane@gmail.com Asst.Prof. Pushpanjali M. Chouragade, Department of

More information

Integration of analytic model and simulation model for analysis on system survivability

Integration of analytic model and simulation model for analysis on system survivability 6 Integration of analytic model and simulation model for analysis on system survivability Jang Se Lee Department of Computer Engineering, Korea Maritime and Ocean University, Busan, Korea Summary The objective

More information

LITERATURE SURVEY (BIG DATA ANALYTICS)!

LITERATURE SURVEY (BIG DATA ANALYTICS)! LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer

More information

Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments

Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments Sun-Yuan Hsieh 1,2,3, Chi-Ting Chen 1, Chi-Hao Chen 1, Tzu-Hsiang Yen 1, Hung-Chang

More information

Nowadays data-intensive applications play a

Nowadays data-intensive applications play a Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted

More information

Apache Flink: Distributed Stream Data Processing

Apache Flink: Distributed Stream Data Processing Apache Flink: Distributed Stream Data Processing K.M.J. Jacobs CERN, Geneva, Switzerland 1 Introduction The amount of data is growing significantly over the past few years. Therefore, the need for distributed

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK DISTRIBUTED FRAMEWORK FOR DATA MINING AS A SERVICE ON PRIVATE CLOUD RUCHA V. JAMNEKAR

More information

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework Facilitating Consistency Check between Specification & Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, Keijiro ARAKI Kyushu University, Japan 2 Our expectation Light-weight formal

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

A MapReduce based Parallel K-Means Clustering for Large Scale CIM Data Verification

A MapReduce based Parallel K-Means Clustering for Large Scale CIM Data Verification A MapReduce based Parallel K-Means Clustering for Large Scale CIM Data Verification Chuang Deng, Yang Liu*, Lixiong Xu, Jie Yang, Junyong Liu School of Electrical Engineering and Information, Sichuan University,

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

A Cloud Computing Implementation of XML Indexing Method Using Hadoop

A Cloud Computing Implementation of XML Indexing Method Using Hadoop A Cloud Computing Implementation of XML Indexing Method Using Hadoop Wen-Chiao Hsu 1, I-En Liao 2,**, and Hsiao-Chen Shih 3 1,2,3 Department of Computer Science and Engineering National Chung-Hsing University,

More information

A Study of Cloud Computing Scheduling Algorithm Based on Task Decomposition

A Study of Cloud Computing Scheduling Algorithm Based on Task Decomposition 2016 3 rd International Conference on Engineering Technology and Application (ICETA 2016) ISBN: 978-1-60595-383-0 A Study of Cloud Computing Scheduling Algorithm Based on Task Decomposition Feng Gao &

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Analytics in the cloud

Analytics in the cloud Analytics in the cloud Dow we really need to reinvent the storage stack? R. Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit Sarkar, Mansi Shah, Renu Tewari Image courtesy NASA

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements

An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/TPDS.6.7, IEEE

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1 International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian

More information

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Yuval Carmel Tel-Aviv University Advanced Topics in Storage Systems - Spring 2013 Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Collaboration System using Agent based on MRA in Cloud

Collaboration System using Agent based on MRA in Cloud Collaboration System using Agent based on MRA in Cloud Jong-Sub Lee*, Seok-Jae Moon** *Department of Information & Communication System, Semyeong University, Jecheon, Korea. ** Ingenium college of liberal

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Load Balancing Algorithm over a Distributed Cloud Network

Load Balancing Algorithm over a Distributed Cloud Network Load Balancing Algorithm over a Distributed Cloud Network Priyank Singhal Student, Computer Department Sumiran Shah Student, Computer Department Pranit Kalantri Student, Electronics Department Abstract

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

IMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM *

IMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM * Journal of Contemporary Issues in Business Research ISSN 2305-8277 (Online), 2012, Vol. 1, No. 2, 42-56. Copyright of the Academic Journals JCIBR All rights reserved. IMPLEMENTATION OF INFORMATION RETRIEVAL

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES

More information

Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing

Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing Ms. More Reena S 1, Prof.Nilesh V. Alone 2 Department of Computer Engg, University of Pune

More information

Various Strategies of Load Balancing Techniques and Challenges in Distributed Systems

Various Strategies of Load Balancing Techniques and Challenges in Distributed Systems Various Strategies of Load Balancing Techniques and Challenges in Distributed Systems Abhijit A. Rajguru Research Scholar at WIT, Solapur Maharashtra (INDIA) Dr. Mrs. Sulabha. S. Apte WIT, Solapur Maharashtra

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data Sachin Jadhav, Shubhangi Suryawanshi Abstract Nowadays, the volume of data is growing at an nprecedented rate, big

More information

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

PWBRR Algorithm of Hadoop Platform

PWBRR Algorithm of Hadoop Platform PWBRR Algorithm of Hadoop Platform Haiyang Li Department of Computer Science and Information Technology, Roosevelt University, Chicago, 60616, USA Email: hli01@mail.roosevelt.edu ABSTRACT With cloud computing

More information

Towards an Adaptive, Fully Automated Performance Modeling Methodology for Cloud Applications

Towards an Adaptive, Fully Automated Performance Modeling Methodology for Cloud Applications Towards an Adaptive, Fully Automated Performance Modeling Methodology for Cloud Applications Ioannis Giannakopoulos 1, Dimitrios Tsoumakos 2 and Nectarios Koziris 1 1:Computing Systems Laboratory, School

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information