Determination of Optimal Epsilon (Eps) Value on DBSCAN Algorithm to Clustering Data on Peatland Hotspots in Sumatra

Similar documents
Density-based clustering algorithms DBSCAN and SNN

Unsupervised learning on Color Images

Density Based Clustering using Modified PSO based Neighbor Selection

Clustering to Reduce Spatial Data Set Size

DS504/CS586: Big Data Analytics Big Data Clustering II

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

DBSCAN. Presented by: Garrett Poppe

AutoEpsDBSCAN : DBSCAN with Eps Automatic for Large Dataset

Clustering Lecture 4: Density-based Methods

Clustering Algorithms for Data Stream

DS504/CS586: Big Data Analytics Big Data Clustering II

An Efficient Clustering Algorithm for Moving Object Trajectories

A New Approach to Determine Eps Parameter of DBSCAN Algorithm

C-NBC: Neighborhood-Based Clustering with Constraints

Density-Based Clustering. Izabela Moise, Evangelos Pournaras

AN IMPROVED DENSITY BASED k-means ALGORITHM

Knowledge Discovery in Databases

Analysis and Extensions of Popular Clustering Algorithms

Heterogeneous Density Based Spatial Clustering of Application with Noise

Clustering Documentation

CS570: Introduction to Data Mining

Scalable Varied Density Clustering Algorithm for Large Datasets

CSE 5243 INTRO. TO DATA MINING

Density-Based Clustering Based on Probability Distribution for Uncertain Data

K-DBSCAN: Identifying Spatial Clusters With Differing Density Levels

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Clustering Part 4 DBSCAN

K-Means Clustering With Initial Centroids Based On Difference Operator

An Enhanced Density Clustering Algorithm for Datasets with Complex Structures

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Analysis of labor employment assessment on production machine to minimize time production

Conveyor Performance based on Motor DC 12 Volt Eg-530ad-2f using K-Means Clustering

Data Clustering With Leaders and Subleaders Algorithm

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

Faster Clustering with DBSCAN

Design optimization method for Francis turbine

International Journal of Advance Engineering and Research Development CLUSTERING ON UNCERTAIN DATA BASED PROBABILITY DISTRIBUTION SIMILARITY

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Data Mining 4. Cluster Analysis

ENHANCED DBSCAN ALGORITHM

Simulation of rotation and scaling algorithm for numerically modelled structures

Improving the accuracy of k-nearest neighbor using local mean based and distance weight

Analyzing Outlier Detection Techniques with Hybrid Method

International Journal Of Engineering And Computer Science ISSN: Volume 5 Issue 11 Nov. 2016, Page No.

Network Traffic Measurements and Analysis

University of Florida CISE department Gator Engineering. Clustering Part 4

Data Mining Algorithms

Contents. 2.5 DBSCAN Pair Auto-Correlation Saving your results References... 18

Review of Spatial Clustering Methods

OPTICS-OF: Identifying Local Outliers

DATA MINING I - CLUSTERING - EXERCISES

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets

A ProM Operational Support Provider for Predictive Monitoring of Business Processes

Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimensional Spatial Data

数据挖掘 Introduction to Data Mining

Text clustering based on a divide and merge strategy

CSE 347/447: DATA MINING

Data Stream Clustering Using Micro Clusters

Quality Metrics for Visual Analytics of High-Dimensional Data

Optimizing Libraries Content Findability Using Simple Object Access Protocol (SOAP) With Multi- Tier Architecture

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Clustering part II 1

Performance analysis of parallel de novo genome assembly in shared memory system

Unsupervised Distributed Clustering

Dimensionality Reduction of Laplacian Embedding for 3D Mesh Reconstruction

PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS

Optimizing the hotspot performances by using load and resource balancing method

An Effective Outlier Detection-Based Data Aggregation for Wireless Sensor Networks

Cluster Analysis: Basic Concepts and Algorithms

A Parallel Community Detection Algorithm for Big Social Networks

Fast Learning for Big Data Using Dynamic Function

A fuzzy mathematical model of West Java population with logistic growth model

WEB USAGE MINING: ANALYSIS DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISE ALGORITHM

Feature weighting using particle swarm optimization for learning vector quantization classifier

Unsupervised Learning : Clustering

ADCN: An Anisotropic Density-Based Clustering Algorithm for Discovering Spatial Point Patterns with Noise

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Count based K-Means Clustering Algorithm

Clustering methods: Part 7 Outlier removal Pasi Fränti

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

DeLiClu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking

Internal Cluster Validation on Earthquake Data in the Province of Bengkulu

The k-means Algorithm and Genetic Algorithm


Chapter 8: GPS Clustering and Analytics

Large-Scale Flight Phase identification from ADS-B Data Using Machine Learning Methods

Clustering on Uncertain Data using Kullback Leibler Divergence Measurement based on Probability Distribution

Application Marketing Strategy Search Engine Optimization (SEO)

Framework of Frequently Trajectory Extraction from AIS Data

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2

CMS High Level Trigger Timing Measurements

Scalable Density-Based Distributed Clustering

Normalization based K means Clustering Algorithm

COMP 465: Data Mining Still More on Clustering

Surveillance of muddy river flow by telemetry system

ISSN: (Online) Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

Clustering in Ratemaking: Applications in Territories Clustering

Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm

Transcription:

IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Determination of Optimal Epsilon (Eps) Value on DBSCAN Algorithm to Clustering Data on Peatland Hotspots in Sumatra Related content - EPS conference comes to London - EPS rewards quasiparticle research - EPS takes stock To cite this article: Nadia Rahmah and Imas Sukaesih Sitanggang 2016 IOP Conf. Ser.: Earth Environ. Sci. 31 012012 View the article online for updates and enhancements. This content was downloaded from IP address 46.3.202.221 on 09/02/2018 at 07:54

Determination of Optimal Epsilon (Eps) Value on DBSCAN Algorithm to Clustering Data on Peatland Hotspots in Sumatra Nadia Rahmah 1 *, Imas Sukaesih Sitanggang 1 1 Computer Science Departement, FMIPA, Bogor Agricultural University, Bogor 16680 E-mail: nadiarahmahh@gmail.com Abstract. In this work we determine the optimal epsilon value on peatland on DBSCAN Algorithm to clustering data on peatland hotspots in sumatera. DBSCAN is a base algorithm for density based data clustering which contain noise and outliers. We found using this method that the area which has the highest density of hotspots in Sumatra in 2013 peatland is contained in cluster 1 of Riau Province that is equal to 2112 hotspots. 1. Introduction Monitoring of hotspots area could be performed by knowing if the hotspots are randomly distributed or clustered. DBSCAN is a base algorithm for density based clustering containing large amount of data which has noise and outliers. DBSCAN has 2 parameters namely Eps and MinPts. However, conventional DBSCAN cannot produce optimal Eps value. DBSCAN modifications is required to determine the optimal Eps value automatically. Determination of the optimal Eps value is aquired using DMDBSCAN algorithm on a single density level corresponding to the k-dist plot. DMDBSCAN is one method of DBSCAN algorithm modification on Eps value to obtain optimal value automatically with different densities. The parameter of values are used in the clustering data on hotspots peatlands in Sumatra. The data set of peatlands hotspots in Sumatera in the period 2013. This research modifies DBSCAN algorithm using the R programming language to get the optimal Eps value automatically to clustering data on peatland hotspots in Sumatera. Implementation of DBSCAN modifications in determining the value of parameter Eps is expected to be a reference to the DBSCAN algorithm, when searching the Eps value and determining the spread of hotspots that could be known clusters of the region that have the potential occurrence of peatlands fire. Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by Ltd 1

2. Results and Discussions Determination of Optimal Eps Value Using R Programming Languange This phase is done by modifying the algorithm DBSCAN in the programming language R. Modifications done by inserting a search algorithm to the algorithm DBSCAN Eps value on R found in accordance with the package fpc. DMDBSCAN pseudocode is presented in Figure 1. Algorithm 1 The pseudo code of the proposed technique DMDBSCAN to find suitable Epsi for each level of density in data set Purpose To find suitable values of Eps Input Data set of size n Output Eps for each varied density Procedure 1 for i 2 for j = 1 to n 3 d(i,j) find distance (x i, x j) 4 find minimum values of distances to nearest 3 5 end for 6 end for 7 sort distances ascending and plot to find each value 8 Eps corresponds to critical change in curves Figure 1 Pseudocode DMDBSCAN Algorithm (Elbatta 2012) The algorithm begins with the calculation euclidean distance on each pair of data latitude and longitude by dist function. Furthermore dist transform data into a matrix using as.matrix function in R. A function of matrix is used because the results dist upper triangular form that needs to be normalized into the matrix intact. Normalization is used to facilitate the search for the k nearest neighbors on each line of the distance calculation. After calculating the distance and the matrix, the initialization is done to count the number of rows in the data by using the function nrow. Further searches carried out three nearest neighbors of each matrix line spacing for sorting is done in ascending for each result the closest distance to the neighbors. Sorting the results of each line of the nearest neighbor made a plot with the x-axis and y- axis is the object and the distance k nearest neighbors. Plots that have formed in ascending calculated the difference in slope of the line to get the value of Eps (Gaonkar and Sawant 2013). A point which has a slope changes or changes in the slope of the plot will be a significant Eps optimal value (Elbatta 2012). In this study, an algorithm DMDBSCAN was adopted is to use single-level density to determine the optimal value Eps so that only produce one value Eps alone. Eps rate determination for single-level density is obtained by calculating the slope between points with equation (y2-y1) / (x2-x1) using a threshold value of 1% so that the slope has a slope of 1% difference is an optimal Eps value. Figure 2 depicts the result of a plot that has been sorted in ascending Using data hotspot 2013. Searches Eps value was done by calculating the slope of the line from any point and sought-after pair of points that have the greatest slope to locate the point. The slope of the line is located at the point of 0.1118034, a point which is the optimal value Eps (Elbatta 2012). 2

Best Eps Value = 0.1118034 Figure 2 Points sorted by distance to the 3rd nearest neighbor Clustering of Data Hotspots Using Optimal Eps Value on DBSCAN Algorithm Clustering process is successfully executed from the search results DBSCAN modification Eps optimal value automatically with the idea of k-dist plot to generate the parameter Eps = 0.1118034 and MinPts = 3. This parameter is used for clustering processes that generate 43 clusters with outliers at 54 points. Then the area has the highest density of hotspots in the area of peatland in Sumatra in 2013 contained in cluster 1 (Riau Province) that is equal to 2112 hotspots. Figure 3 shows the area in Riau province were included in cluster 1. Districts included in cluster 1 is Bengkalis, Pelalawan, Rokan Hilir, Siak, Rokan Hulu and Indragiri Hulu. Subdistrict contained in the districts included in cluster 1 can be seen in Figure 3. Cluster Analyst Figure 3 Area/Subdistrict was included in cluster 1 The analysis process performed analysis of running time to the result of determining the optimal Eps automatically on DBSCAN algorithms for clustering data using R. This analysis includes a comparison generated in this studied with research conducted Usman (2014). This research resulted in a runtime of 1.72 seconds to 32.29 seconds for the system and user time use 2 GB of RAM. The system time is the speed of the system running the job before the user enter data. User time is the time used after users enter data into the system. Therefore, this research is 3

more efficient in classifying the hotspots data because it can determine the optimal Eps value automatically. 3. Conclusion In this study, an algorithm DMDBSCAN was adopted is to use single-level density to determine the optimal value Eps so that only produce one value Eps alone. The slope of the line is located at the point of 0.1118034, a point which is the optimal value Eps. Clustering process is successfully executed from the search results of DBSCAN modification. We obtain the an Eps optimal value automatically with the idea of k-dist plot resulting in the generatiopn of parameters Eps = 0.1118034 and MinPts = 3. This parameter is used for clustering processes that generate 43 clusters with outliers at 54 points. It follows that the area which has the highest density of hotspots in Sumatra in 2013 peatland is contained in cluster 1 (Riau Province) that is equal to 2112 hotspots. The advantage of this method is that the optimal Eps value can be determined automatically. References [1] Elbatta MNT. 2012. An improvement for DBSCAN algorithm for best results in varied densities [disertasi]. Gaza (PS): Islamic University of Gaza. [2] Ester M, Kriegel HP, Sander J, Xu X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Di dalam: Simoudis E, editor. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96); 1996 Agu 4-6. hlm 226-231. [3] Gaonkar MN, Sawant K. 2013. AutoEps DBSCAN: DBSCAN with Eps automatic for large dataset. International Journal on Advanced Computer Theory and Engineering. 2(2): 11-16. [4] Han J, Kamber M, Pei J. 2012. Data Mining: Concepts and Techniques. 3rd ed. San Francisco (US): Morgan-Kauffman. [5] Usman M. 2014. Spatial clustering berbasis densitas untuk penyebaran titik panas sebagai indikator kebakaran hutan dan lahan gambut di Sumatera [tesis]. Bogor (ID): Institut Pertanian Bogor. 4