An improved MapReduce Design of Kmeans for clustering very large datasets

Size: px
Start display at page:

Download "An improved MapReduce Design of Kmeans for clustering very large datasets"

Transcription

1 An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Oussama.Lachiheb@gmail.com Mohamed Salah Gouider Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Ms.gouider@yahoo.fr Abstract-Clustering is an important data analysis technique which is used for the purpose to classify the data into groups of similar items. Clustering has high applicability in different fields. However, it becomes very challenging due to the sharply increase in the volume of data generated by modern applications. K means is a simple and widely used algorithm for cluster analysis. But, the traditional k-means is computationally expensive, sensitive to outliers and has an unstable result hence its inefficiency when dealing with very large datasets. Solving these issues is the subject of many recent research works. MapReduce is a simplified programming model designed to process data intensive applications in a parallel environment. In this paper, we propose an improved design of k-means based on mapreduce in order to adapt it to handle large-scale datasets by reducing its execution time. Moreover we will propose two other algorithms. The first is designed to remove outliers from the dataset, and the second is designed to select automatically the initials centroids thereby stabilize the result. The implementation of our proposed algorithm on Hadoop platform shows that is much faster than three other existing algorithms in the literature. Data Keywords-k-meansj mapreducej Hadoopj C1usteringj Big I. INTRODUCTION In today's era, the proliferation of data coming quickly and from different sources creates challenges in their collection, processing, management and analysis. Big data brings out from the necessity to tackle these challenges in order to seize opportunities that data offer. Data mining is a set of techniques applied to discover interesting knowledge from huge datasets but the traditional data mining approaches cannot be directly used for big data due to their absolute complexity. Clustering is one of the most popular tools in data mining, used to accurately analyze the massive volume of data generated by modem applications. It consists in partitioning a large set of data into clusters in such a way that the patterns within a cluster are maximally similar and patterns in different clusters are maximally dissimilar. But it becomes very challenging due to the continuous rise of data quantity. Several algorithms have been designed to perform clustering, each one uses different principle. They are divided into hierarchical, partitioning, density-based, grid-based and model based algorithms [1]. K-means is the most commonly known partitioning algorithm used in cluster analysis. Its simplicity and its performance are two reasons that attracted considerable interest, especially in the recent years. However, it has some shortcomings when dealing with very large datasets. This is due to its high computational complexity, its sensitivity to outliers and the dependence of results to the initial centroids selected randomly. Many solutions have been proposed to improve the performance of k-means. But no one provide a global solution. Some of proposed algorithms are fast but they sacrifice the quality of clusters. Others generate clusters of good quality but they are very expensive in term of computational complexity. Xiaoli Cui and al. proposed an improved algorithm k means [2]. They used a sampling technique to reduce the 110 cost and the network cost of PK-means. Experimental results showed that the optimized algorithm is efficient and performs better compared with k-means. As the algorithm is applied only to representative points instead of the whole dataset, the accuracy is not significantly high. Duong Van Hieu and Phayung Meesad proposed in their paper [3] a new approach for reducing executing time of the k-means by cutting off a number of last iterations. The experiment results showed that this method can reduce up to 30% of iterations, which is equivalent to 30% of executing time, while maintaining up to 98% accuracy. However, the fact of choosing randomly the initial centroids implies an instability of clustering. In addition to that, clustering result may be affected by noise points, leading to inaccurate result. Li Ma and al. in their paper [4] proposed a solution for improving the quality of clustering of the traditional k-means. They solved the instability of clustering by selecting systematically the value of k as well as the initial centroids. The sensitivity to outliers is addressed in their paper and solved by reducing the number of noise points. This algorithm /15/$ IEEE

2 produces good quality clusters but it takes more computation time. Thus, the efficiency and effectiveness of k-means remains to be further improved. On this basis, our paper presents an enhanced MapReduce design of k-means for clustering very large datasets. Firstly we will propose a solution to reduce the execution time of k-means in order to adapt it to handle very large datasets. Secondly we will propose an algorithm to select systematically the initial centroids in order to stabilize the result and another algorithm to remove outliers from the data set because they degrade the quality of clustering. This paper is structured as follows. Firstly, related works is presented. Secondly, the proposed approach is explained. Thirdly, Experimental analysis is shown. Finally, conclusion and future scope will be covered. A II. BACKGROUND AND RELATED WORKS MapReduce Overview MapReduce [5] is a programming model developed by Google in 2004; which supports distributed parallel computing across hundreds or thousands of nodes. Its goal is to hide the messy details of parallelization, fault tolerance and data distribution. With MapReduce paradigm, the programmer has only to define two functions namely Map and Reduce. Mapreduce is responsible of all the remaining work. It permits to divide the input data into small data blocks. For each data block a map function is called. The input and output data for both map and reduce stages are in the form of key-value pairs (k,v). After the processing of the map tasks, MapReduce framework sorts the output data by key. Then, the data with the same key is entered into the same reduce task which merge them to produce a single result. More details of MapReduce can be found in [6]. The major advantages of Map Reduce are: The ease-of-use: even for programmers who are not familiar with parallel programming. Scalability: The possibility to parallelize the processing of map and reduce operations enables to increase processing power by merely adding servers to the cl uster. Failover properties: MapReduce can handle machine failures and intra machine communications. B. Hadoop Overview Hadoop [7][8] is the most popular open source framework which implements the MapReduce programming model. It is designed to write and run distributed data-intensive applications. It enables to handle huge volume of data across clusters of commodity hardware. Hadoop consists mainly of two components: The Hadoop Distributed File System (HDFS) which is an extended version of Google File System (GFS) designed to store reliably a massive amount of data on cluster of machines and to stream it at high bandwidth to user applications. The MapReduce engine which is composed of a Job Tracker presented in a master node and a set of Task Trackers presented in slave nodes. The Job Tracker is responsible to determine the plan of execution, assign tasks to slave nodes and insures the control of them. The Task Trackers perform the map and reduce tasks till their completion. Hadoop gained its popularity due to its several advantages: Outstanding scalability: It could be scale out from a single computer machine to thousands of computer machines without performing change. High failure tolerance: Hadoop provides failure tolerance at different levels including disks, nodes, switch and network. Cost effectiveness: Hadoop enables massively parallel computing to commodity hardware. Flexibility: Hadoop is shema-less i.e it can handle all types of data. C. K-means algorithm K-means algorithm [9] partitions the data into k clusters. Each cluster is represented by its centroid, which is the mean of patterns belonging to it. The algorithm k-means proceeds in 3 steps: Input: k: number of clusters, datasets. Output: Set of k clusters 1) Initially, it chooses arbitrary k patterns as centroids and assigns each of the remaining patterns to the closest centroid. 2) For each cluster, calculate the centroid and reassign each pattern in the dataset to the closest centroid. The K-means algorithm begins with an initial set of K centers selected randomly and iteratively updates it so as to decrease the error function. It stops when a convergence condition is verified. There are many convergence conditions such as stopping when obtaining a locally optimal partition (the partitioning error is not reduced by the relocation of the centers) or reaching a pre-defined number of iterations. Advantages: k-means has the following benefits i) Simple implementation with high efficiency and scalability ii) Linear complexity of time Disadvantages i) First, it is sensitive to outliers which affect the accuracy of clustering. ii) Second, it is non deterministic in the sense that it produces different clusters for each execution as the result heavily depends on the selection of initial centroids. iii) Third, it is computationally expensive when the number of data points increases.

3 D. Parallel k-means based MapReduce PK-means (Parallel k-means) is a parallel k-means based on mapreduce proposed in [10] [11). It consists of three functions: Map, Combine and Reduce. The Map function is responsible to calculate distances of each data point in the mapper to the k centers and assign it to the closest one. The combine function is devoted to local centroid calculations. The Reduce function calculates the new global centroids. PKmeans algorithm is as follow. Input: k: number of clusters, datasets. Output: Set of k clusters Map Input: A list of <keyl, valuel> pairs, k global centroids. Where keyl is the position of data point and valuel is its content. Output: A list of <key2, value2> pairs. Where key2 is the index of the cluster and value2 is the points belonging to that 1) Calculate distances between the data point and k centroids 2) Assign it to the closest centroid 3) Repeat 1) and 2) until processing all the data point in the mapper Combine Input: A list of <key2, value2> pairs Output: A list of <key3, value3> pairs. Where key3 is the index of the cluster and value3 is the sum of points belonging to that cluster associated with their number. 1) Calculate the sum of data points belonging to the same 2) Calculate their number. 3) Repeat 1) and 2) until processing the k clusters. Reduce Input: A list of pairs <key3, value3> Output: A list of <key4, value4>. Where key4 is the index of the cluster and value4 is its new global centroid. 1) Calculate the new centroid of cluster: The mean of data points belonging to it. 2) Repeat 1) until calculate the centroid of k clusters. This mapreduce job is repeated until convergence: the centroids don't change. Advantages: Pk-means has the following strengths It is easy to implement. It is very efficient and takes less time to build clusters. It shows its performance with respect to speed up, scale up and size up. Disadvantages : PK-means has two drawbacks. It is sensitive to outliers and suffers from instability of results like k-means. The 110 cost and the network cost are very expensive as the whole dataset will be shuffled over the cl uster of nodes at each time. III. IM-KMEANS FOR CLUSTERING VERY LARGE DATA SETS Our purpose in this paper is to solve the problems aforementioned above. To do this, we proposed an algorithm to remove the noise points from the dataset and another algorithm to select systematically the initial centroids. Moreover, we gave a solution to reduce the number of data points to be clustered at each iteration and thus improving the clustering speed and complexity. A Design of proposed algorithm The proposed algorithm consists of three stages: Stage 1: In this step the outliers are removed from the dataset. We define the radius eps and the number of neighbors nb to measure the density area to which the object belongs. For each data point in the dataset we calculate the number of its neighbors (the data points which the distance between them and the considered data point is less than eps). If the number of neighbors of the data point is less than nb, it is considered as outlier and it will be removed from the dataset. Stage 2: In this step, the initial centroids are selected systematically. To perform this task, we define a mapreduce job which consists of three functions: Map, Combine and Reduce. The map function is devoted to calculate the distances between the data points stored in the mapper and the centroids already selected. The combine function is used to determine the local farthest data point from them. The reduce function determine the most distant data point in the dataset from the centroids already selected. Before executing this mapreduce job, the master node determines the two first centroids which are the farthest objects in the input dataset. This algorithm is repeated k-2 times to select the k initial centroids. Stage 3: In this step, the dataset is partitioned in k clusters. A mapreduce job is defined to cluster the input dataset. As we aim to reduce the clustering speed and complexity, we will use a structure which enables us to reduce the number of data points to be clustered at each iteration. Our idea is to use a table that will contain the data points misplaced. The clustering is applied only on data points existing in this table. The job of clustering is composed of three functions: Map, Combine and Reduce. The map function calculates the distances between the data points existing in the table of misplaced points and the k global centroids. Then, it affects each data point to the closest centroid. The data point that the distance between it and the closest centroid is less than d (a

4 predetermined value) is considered as placed in the right cluster and it will be removed from the table of misplaced points. The combine function is used to calculate the local centroids. Finally, the reduce function devoted to computing the new global centroids. The following figure shows the three stages of the proposed algorithm. Stage 1: Combine Input: A list of pairs <key2, value2>. Output: A pairs <key3, value3>. Where key3 is the farthest object from all the objects in C and value3 is the corresponding distance. Reduce: 1) Find out the farthest object from all the objects in C Input: <key3, value3> Output :< key4, value4>. Where value4 is the initial centroids. 1) Find the farthest object in D from the objects in C 2) Add that object to C B. Proposed algorithm Fig. I. Execution of proposed algorithm To explain our improved algorithm, it is useful to note: D: The original dataset which will be clustered C: The dataset that will contain the initial centroids selected k: The number of the clusters predetermined by the user. Stage 1: Remove noise points from the dataset D. Algorithm 1 Input: Dataset D, eps, nb. Output: Dataset D without outliers. 1) nbneighbors = 0 2) Calculate the distance between a data point x and another point y. If the distance between x and y is less than eps, then nbneighbors = nbneighbors + 1 3) Repeat 1) and 2) until (nbneighbors = nb) or all the distances between x and the points in D are calculated 4) If (nbneighbors = 0) than remove x from the dataset D 5) Repeat 1), 2), 3), 4) until all the data points in the dataset are tested Stage 2: Select initial centroids. Algorithm 2 Map: Input: A list of pairs <keyl, valuel>, C. Where keyl is the position of a data point in D and valuel is its content. Output: A list of pairs <key2, value2>. Where key2 is the farthest object from an object in C and value2 is the corresponding distance. 1) Calculate the distance between an object of C and the objects ofd. 2) Find out the farthest object from that object 3) Repeat 1) and 2) until processing all the objects in C Stage 3: Cluster the dataset. Algorithm 3 Map Input: A list of <keyl, valuel> pairs, k global centroids, d. Where keyl is the position of the data point and valuel is its content, d is a predetermined value. Output: A list of pairs <key5, value5> where key5 is the index of the cluster and value5 is the points belonging to that 1) Initially all the data points stored in the mapper are placed in a table containing the misplaced points. 2) Calculate distances between an element from the table and k global centroids 3) If the distance between the element and the closest centroid is less than d Remove this element from the table of misplaced objects and assign it to the corresponding Else, assign it to the cluster with the closest centroid 4) Repeat 2) and 3) until processing all the data points in the table Combine Input: A list of pairs <key5, value5> Output: A list of pairs <key6, value6> where is the index of the cluster and value6 is the sum of points belonging to that cluster associated with their number. 1) Calculate the sum of data points belonging to the same 2) Calculate their number. 3) Repeat 1) and 2) until processing the k clusters. Reduce Input: A list of pairs <key6, value6> Output: A list of <key7, value7> where key7 is the index of the cluster and value7 is its new global centroid. 1) Calculate the new centroid of cluster: The mean of data points belonging to it. 2) Repeat 1) until calculate the centroid of k clusters.

5 Global 1M K-means algorithm IV. IMPLEMENTATION AND EXPERIMENTAL RESULTS Input: A dataset D, number of cluster k, eps, nb, d. Output: K clusters 1) At master worker Remove noise points Select the two first centroids: the farthest objects indo Form a list of <keyl, valuel> from D, called Listl Divide Listl into W blocks and distribute them to W workers 2) At all workers Run map function to produce <key2, value2> pairs Run combine function to produce <key3, value3> pairs 3) At master worker Run reduce function to produce <key4, value4> pairs are initial global centroids 4) At all workers Run map function to produce <key5, value5> pairs are an index of cluster and the data points belonging to it. Run combine function to produce <key6, value6> pairs are sum of points belonging to the same cluster and their number. 5) At master node Run reduce function to produce <key7, value7> pairs are global centroids 6) At master node Calculate the difference between the new global centroids and the old global centroids. If there is no change stop Otherwise, repeat from step 4. A Implementation of proposed algorithm Our algorithm was implemented using JAVA and Hadoop framework to ensure the execution of MapReduce tasks. We have used two machines each one has an Intel i3 processor and 4 GB of RAM. The dataset used in our experimental studies is a real dataset collected from the Tunisian stock exchange daily trading in fiscal years 2012, 2013 and This dataset contains 1,020,000 records each one has six attributes. Its size is 1,IGB. The characteristics of used dataset are shown in tablel. TABLEL EXPERIMENTAL DATASET Dataset Number of Number of Size records Attributes Datasetl 1,020, ,IGB B. Experimental results Firstly, we tested our algorithm to see the number of iterations with different numbers of data points. Table 2 shows the experimental results obtained. The following figure depicts the results showed in table 2. Flowchart of proposed algorithm is described in Fig. 2. Fig. 3. Number of iterations of 1M k-means The results obtained show that when the number of data points increase, the number of iterations increase also. Secondly, we have compared the proposed algorithm with three others algorithms that is traditional k-means, PK-means and Fast k-means in term of execution time. The experiments are carried out six times with different number of records. Fig. 4 illustrates the results obtained. Fig. 2. Flowchart of IM k-means Fig. 4. Running time comparison

6 According to the figure, it is clear that the execution time of 1M kmeans is much more less than the execution time of traditional k-means, Pk-means and Fast k-means. Thus, it is more efficient than them and it is more adapted to handle large-scale datasets. V. CONCLUSION AND FUTURE SCOPE In this paper, an improved design of k-means for clustering very large datasets efficiently is proposed. The experimental results show that our algorithm takes less time as compared to traditional k-means, PK-means and Fast k-means. But it has two limitations. First the value of k is required as input. Second, it can be applied only for datasets which have attributes with numerical values. In the future, we can enhance our approach such that the number of clusters is determined automatically. We can also extend the application of this approach for datasets with categorical attributes. References [I] C. Fraley and A. E. Raftery. How Many Clusters? Which Clustering Method?Answers Via Model-Based Cluster Analysis. Technical Report No Department of Statistics University of Washington, [2] [2] C. Xiaoli and al. Optimized big data K-means clustering using Map Reduce. Springer Science + Business Media New York [3] V. Duon, M. Phayung. Fast K-Means Clustering for very large datasets based on MapReduce Combined with New Cutting Method (FMR. K Means). Springer Inter national Publishing Switzerland, 2015.K. Elissa, "Title of paper if known," unpublished. [4] M. Li and al. An improved k-means algorithm based on Mapreduce and Grid. International Journal of Grid Distribution Computing, [5] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, [6] T. White. Hadoop: The Definitive Guide. 1st ed. O'Reilly Media, Inc., [7] Apache Hadoop. [8] J. Venner. Pro Hadoop. Apress, June 22, [9] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, [10] L. Zhenhua and al. Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce. Springer-Verlag Berlin Heidelber g, [II] Z. Weizhong and al. Parallel K-Means Clustering Based on MapReduce. Inter national Conference On Cloud Computing Technology And Science - CloudCotn, 2009.

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Mounica B, Aditya Srivastava, Md. Faisal Alam

Mounica B, Aditya Srivastava, Md. Faisal Alam International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 3 ISSN : 2456-3307 Clustering of large datasets using Hadoop Ecosystem

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,

More information

An Optimization Algorithm of Selecting Initial Clustering Center in K means

An Optimization Algorithm of Selecting Initial Clustering Center in K means 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017) An Optimization Algorithm of Selecting Initial Clustering Center in K means Tianhan Gao1, a, Xue Kong2, b,* 1 School

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved Implementation of K-Means Clustering Algorithm in Hadoop Framework Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India Abstract Drastic growth

More information

A Parallel Community Detection Algorithm for Big Social Networks

A Parallel Community Detection Algorithm for Big Social Networks A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK DISTRIBUTED FRAMEWORK FOR DATA MINING AS A SERVICE ON PRIVATE CLOUD RUCHA V. JAMNEKAR

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Comparative Analysis of K means Clustering Sequentially And Parallely

Comparative Analysis of K means Clustering Sequentially And Parallely Comparative Analysis of K means Clustering Sequentially And Parallely Kavya D S 1, Chaitra D Desai 2 1 M.tech, Computer Science and Engineering, REVA ITM, Bangalore, India 2 REVA ITM, Bangalore, India

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters. Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data Sachin Jadhav, Shubhangi Suryawanshi Abstract Nowadays, the volume of data is growing at an nprecedented rate, big

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

732A54/TDDE31 Big Data Analytics

732A54/TDDE31 Big Data Analytics 732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

K+ Means : An Enhancement Over K-Means Clustering Algorithm

K+ Means : An Enhancement Over K-Means Clustering Algorithm K+ Means : An Enhancement Over K-Means Clustering Algorithm Srikanta Kolay SMS India Pvt. Ltd., RDB Boulevard 5th Floor, Unit-D, Plot No.-K1, Block-EP&GP, Sector-V, Salt Lake, Kolkata-700091, India Email:

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Fast and Effective System for Name Entity Recognition on Big Data

Fast and Effective System for Name Entity Recognition on Big Data International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India Volume 115 No. 7 2017, 105-110 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN Balaji.N 1,

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

Incremental K-means Clustering Algorithms: A Review

Incremental K-means Clustering Algorithms: A Review Incremental K-means Clustering Algorithms: A Review Amit Yadav Department of Computer Science Engineering Prof. Gambhir Singh H.R.Institute of Engineering and Technology, Ghaziabad Abstract: Clustering

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com

More information

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification Manoj Praphakar.T 1, Shabariram C.P 2 P.G. Student, Department of Computer Science Engineering,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

SBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique for Big Data

SBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique for Big Data International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ SBKMMA: Sorting Based K Means and Median Based Algorithm

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Enhanced Hadoop with Search and MapReduce Concurrency Optimization Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)

More information

Performing MapReduce on Data Centers with Hierarchical Structures

Performing MapReduce on Data Centers with Hierarchical Structures INT J COMPUT COMMUN, ISSN 1841-9836 Vol.7 (212), No. 3 (September), pp. 432-449 Performing MapReduce on Data Centers with Hierarchical Structures Z. Ding, D. Guo, X. Chen, X. Luo Zeliu Ding, Deke Guo,

More information

Document Clustering with Map Reduce using Hadoop Framework

Document Clustering with Map Reduce using Hadoop Framework Document Clustering with Map Reduce using Hadoop Framework Satish Muppidi* Department of IT, GMRIT, Rajam, AP, India msatishmtech@gmail.com M. Ramakrishna Murty Department of CSE GMRIT, Rajam, AP, India

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Figure 1 shows unstructured data when plotted on the co-ordinate axis

Figure 1 shows unstructured data when plotted on the co-ordinate axis 7th International Conference on Computational Intelligence, Communication Systems and Networks (CICSyN) Key Frame Extraction and Foreground Modelling Using K-Means Clustering Azra Nasreen Kaushik Roy Kunal

More information

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce Maria Malek and Hubert Kadima EISTI-LARIS laboratory, Ave du Parc, 95011 Cergy-Pontoise, FRANCE {maria.malek,hubert.kadima}@eisti.fr

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

SBKMA: Sorting based K-Means Clustering Algorithm using Multi Machine Technique for Big Data

SBKMA: Sorting based K-Means Clustering Algorithm using Multi Machine Technique for Big Data I J C T A, 8(5), 2015, pp. 2105-2110 International Science Press SBKMA: Sorting based K-Means Clustering Algorithm using Multi Machine Technique for Big Data E. Mahima Jane* and E. George Dharma Prakash

More information

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD S.THIRUNAVUKKARASU 1, DR.K.P.KALIYAMURTHIE 2 Assistant Professor, Dept of IT, Bharath University, Chennai-73 1 Professor& Head, Dept of IT, Bharath

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case 1 / 39 Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case PAN Jie 1 Yann LE BIANNIC 2 Frédéric MAGOULES 1 1 Ecole Centrale Paris-Applied Mathematics and Systems

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

ADVANCES in NATURAL and APPLIED SCIENCES

ADVANCES in NATURAL and APPLIED SCIENCES ADVANCES in NATURAL and APPLIED SCIENCES ISSN: 1995-0772 Published BYAENSI Publication EISSN: 1998-1090 http://www.aensiweb.com/anas 2017 April 11(4): pages 169-174 Open Access Journal Analysis of Accuracy

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

A Review Approach for Big Data and Hadoop Technology

A Review Approach for Big Data and Hadoop Technology International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 A Review Approach for Big Data and Hadoop Technology Prof. Ghanshyam Dhomse

More information

AN IMPROVED DENSITY BASED k-means ALGORITHM

AN IMPROVED DENSITY BASED k-means ALGORITHM AN IMPROVED DENSITY BASED k-means ALGORITHM Kabiru Dalhatu 1 and Alex Tze Hiang Sim 2 1 Department of Computer Science, Faculty of Computing and Mathematical Science, Kano University of Science and Technology

More information

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI 2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing

More information

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining Obtaining Rough Set Approximation using MapReduce Technique in Data Mining Varda Dhande 1, Dr. B. K. Sarkar 2 1 M.E II yr student, Dept of Computer Engg, P.V.P.I.T Collage of Engineering Pune, Maharashtra,

More information

Clustering websites using a MapReduce programming model

Clustering websites using a MapReduce programming model Clustering websites using a MapReduce programming model Shafi Ahmed, Peiyuan Pan, Shanyu Tang* London Metropolitan University, UK Abstract In this paper, we describe an effective method of using Self-Organizing

More information