PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS

Similar documents
Smart Door Security Control System Using Raspberry Pi

Introduction to Machine Learning CMU-10701

University of Florida CISE department Gator Engineering. Clustering Part 2

IoT Based Traffic Signalling System

CSE 5243 INTRO. TO DATA MINING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning : Clustering

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

A Low Cost Internet of Things Network for Contamination Detection in Drinking Water Systems Using Raspberry Pi

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data

Introduction to Mobile Robotics

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Opinion Mining and Sentimental Analysis of TripAdvisor.in for Hotel Reviews

Smart Home Automation and Live Streaming By Using Raspberry PI

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Unsupervised Learning

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Implementation of Real Time Local Search Particle Filter Based Tracking Algorithms on BeagleBoard-xM

Cluster Analysis. Ying Shen, SSE, Tongji University

Iteration Reduction K Means Clustering Algorithm

II. PROPOSED SYSTEM AND IMPLEMENTATION

Smart Home Automation Using Web-Server

9.1. K-means Clustering

CSE 5243 INTRO. TO DATA MINING

Video and Image Processing for Finding Paint Defects using BeagleBone Black

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Collaborative Rough Clustering

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Texture Image Segmentation using FCM

Performance Analysis of Data Mining Classification Techniques

Machine Learning. Unsupervised Learning. Manfred Huber

Novel Approaches of Image Segmentation for Water Bodies Extraction

New Approach for K-mean and K-medoids Algorithm

A Raspberry Pi Based System for ECG Monitoring and Visualization

Implementation of ATM security using IOT

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Dynamic Clustering of Data with Modified K-Means Algorithm

K-Means. Oct Youn-Hee Han

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Using Center Representation and Variance Effect on K-NN Classification

10701 Machine Learning. Clustering

A Smart E-Health Care System with Health Sensors and Cloud

Electronics Single Board Computers

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

ENHANCED DBSCAN ALGORITHM

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Clustering and Visualisation of Data

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Semi-Supervised Clustering with Partial Background Information

Comparative Analysis of K means Clustering Sequentially And Parallely

Regression Based Cluster Formation for Enhancement of Lifetime of WSN

Unsupervised Learning: Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

6. Dicretization methods 6.1 The purpose of discretization

Understanding Clustering Supervising the unsupervised

A FUZZY LOGIC BASED METHOD FOR EDGE DETECTION

Topic 1 Classification Alternatives

Smart Organization. Vivek Ghule Department of Computer Engineering Vishwakarma Institute of Information Technology Pune, India

Parallel K-Means Clustering with Triangle Inequality

Remote Surveillance and Smart Stock Monitoring Robot. using Internet of Things

Data Informatics. Seon Ho Kim, Ph.D.

Major Components of the Internet of Things Systems

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Machine Learning using MapReduce

Based on Raymond J. Mooney s slides

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

ECLT 5810 Clustering

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Machine Learning (BSMC-GA 4439) Wenke Liu

Comparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Pattern Clustering with Similarity Measures

Online Social Networks and Media. Community detection

Intelligent Image and Graphics Processing

Comparative Study Of Different Data Mining Techniques : A Review

[Tiwari*, 5(2): February, 2016] ISSN: (I2OR), Publication Impact Factor: 3.785

A Genetic Clustering Technique Using a New Line Symmetry Based Distance Measure

Association Rule Mining and Clustering

CS Introduction to Data Mining Instructor: Abdullah Mueen

Supervised vs. Unsupervised Learning

Enhancing K-means Clustering Algorithm with Improved Initial Center

Raspberry Pi System For Detecting Machine Status

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Mounica B, Aditya Srivastava, Md. Faisal Alam

Overlapping Clustering: A Review

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Clustering CS 550: Machine Learning

[7.3, EA], [9.1, CMB]

Object Segmentation in Color Images Using Enhanced Level Set Segmentation by Soft Fuzzy C Means Clustering

k-means, k-means++ Barna Saha March 8, 2016

Using Machine Learning to Optimize Storage Systems

Transcription:

PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS Pradeep Nagendra Hegde 1, Ayush Arunachalam 2, Madhusudan H C 2, Ngabo Muzusangabo Joseph 1, Shilpa Ankalaki 3, Dr. Jharna Majumdar 4, 5 1 BE Final Year Student, Dept. of Computer Science & Engg. 2 BE Final Year Student, Dept. of Electronics & Communications Engg 3 Asst Prof. Dept. of Mtech Computer Science & Engg. 4 Dean R&D Prof. and Head Dept. of MTech Computer Science & Engg 5 Head, Center for Robotics Research Nitte Meenakshi Institute of Technology, Yelahanka, Bangalore Abstract Data Mining plays a pivotal role in genomics, bio informatics, agriculture, fraud detection and financial banking. Extensive research has been done in this domain in the last three decades and numerous algorithms have been developed. This paper aims to make a comparative analysis of a few popular algorithms like Fuzzy C and K- for different data in real time and of development boards like Raspberry Pi 2,. Conventionally, the value of K in K- is not known initially and is given as input. This paper aims to avoid manual input by the user by using Batchelor Wilkins algorithm and Elbow Method. The performance of the data mining algorithms is determined using a set of Quality Metric (QM) parameters. The results show the most suitable algorithm for clustering for different databases. Timing analysis is also performed to determine the best suited development board for real time applications. The algorithms have been implemented in C on and in Python on Raspberry Pi 2. Index Terms: Batchelor Wilkins,, Data Mining, Elbow Method, Fuzzy C, K-, Raspberry Pi 2, QM Parameters, I. INTRODUCTION is the task of splitting up data points into a number of groups such that data points belonging to a group are closely correlated than those in the other groups. The objective is to segregate groups with similar characteristics and assign them into clusters. This project involves classifying the objects based on their features, derived from various methods. The features are subjected to Data Mining techniques like Fuzzy C and K- for performance comparison and classification. The efficiency of Data Mining techniques may vary with the input data set. Hence, choosing the most suitable Data Mining technique for a particular data set or purpose is vital to perform clustering. This paper fulfills this prerequisite by presenting a comparative analysis on the aforementioned Data Mining techniques. The performances of these algorithms are compared using various quality metric parameters, namely sum of squared error, intra cluster distance, and inter cluster distance. This paper also contributes to the optimization of K- clustering by determining the value of K automatically using Batchelor Wilkins algorithm. This alleviates the burden of providing a random value for K. II. LITERATURE SURVEY Data mining deals with extracting information from a data set and transforming it into an understandable structure for further use. is the part of data mining which deals 66

with batching objects that are similar in some sense to one another. In the last five decades, extensive research has been carried out and many algorithms have been developed. A. FUZZY C MEANS TECHNIQUE Fuzzy C [2] clustering is a soft clustering algorithm. It allocates membership to each data point corresponding to each cluster center on the basis of distance between the data point and the cluster center. Lesser the separation between the data point and the particular cluster center, more is the membership of the data point with that particular cluster center. The membership value is calculated between the data point and the cluster center. It is contrived as Where, describes membership of i th data to j th cluster center and dij describes Euclidean distance between i th data and j th cluster center. maximization (EM) algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Both employ cluster centers for data modelling. K- clustering tends to determine clusters based on their compactness or proximity among data points, while the EM mechanism allows clusters to have variegated structures and outlines. describes Euclidean distance between i th data and j th cluster center. Based on this the clusters will be formed. C. BATCHELOR WILKINS TECHNIQUE Batchelor Wilkins [3] algorithm is a clustering algorithm used when number of cluster centers is unknown. It is a simple heuristic procedure for clustering and is sometimes referred to as maximum distance algorithm. It differs from Fuzzy C means algorithm in the fact that the value of k i.e. the number of cluster centers isn t predefined. We determine the number of cluster centers automatically using Batchelor Wilkins algorithm. Select c cluster centers and calculate the fuzzy membership using the above mentioned formula. After all the operations compute the fuzzy center using the earlier results. Where, describes j th cluster center and xi will be the data point values. B. K-MEANS TECHNIQUE K- [2] is a hard clustering algorithm, originally from signal processing, that is widely used for cluster analysis in Data Mining. K- clustering focuses on segregating n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as the template of the cluster. The problem is computationally difficult (NP-hard); however, there are high performance heuristic algorithms that are commonly utilized to ensure convergence to a local optimum occurs rapidly. These are usually similar to the expectation Centroid = Where, f(xi) is the value of the data point xi i.e. a feature or property and m is the number of such data points. D. ELBOW METHOD Elbow method is a technique used to determine the best value of k in K- and the best initial value of clusters in Fuzzy C algorithms. It is a technique used to interpret and validate within-cluster consistency. 67

E. QUALITY METRIC PARAMETERS i. Sum of Square Error (SSE) SSE [5] computes the average of the squares of the errors i.e. the difference between the approximator and the approximated values. SSE = Where, xi is the data point. ii. Intra Cluster Distance (ICD) Intra Cluster Distance [4] will compute the value between the data points and the supreme value of the data point within that cluster. Ethernet, Audio, and Video. It is used for real time image and video processing. IV. PROPOSED WORK Data Mining algorithms like Fuzzy C, K- are implemented to cluster the input data points. Batchelor Wilkins algorithm is used to determine the initial number of clusters i.e. K automatically. Elbow Method is used to find the best value of k among those specified by the user. Both Fuzzy C and K- algorithms are implemented on development boards like Raspberry Pi 2 and. Batchelor Wilkins sample ICD = iii. Inter Cluster Distance (InCD) Inter Cluster Distance [4] is the quality metric value which is calculated between the supreme data point of one cluster and the minimum data point of the adjacent cluster. InCD = III. DEVELOPMENT BOARDS This paper deals with implementation of Data Mining techniques on hardware for real time applications. Development boards like Raspberry Pi 2 and are used in order to obtain results in real time. Raspberry Pi 2 Raspberry Pi 2 is a low cost minicomputer which supports Linux and Windows 10 OS. It is used to interface sensors, actuators. It has 1 GB of RAM, 700 MHz ARM processor, 26 GPIO pins, Wi-FI module, and provisions for USB, Ethernet, HDMI, Audio, and Video. It is extensively used for IoT, security, and gaming applications. is a powerful single board computer which supports Linux and Android OS. It is a quad core processor which provides powerful prototyping. It has 1 GB of RAM, ARM Cortex imx 6+ Arduino Due processors, 72 GPIO pins, Wi-FI module, and provisions for USB, HDMI, Fuzzy C flowchart 68

K- flowchart Fig 3. QM Analysis of Leaf Images Data Algorith m Quality Metrics Parameters SSE ICD InCD Fuzzy C 50,397.87109 8,763.743164 42,285.863281 K- 12,405.314453 6,405.314453 258,815.375 Fig 4. Timing Analysis (in ms) of Agriculture Data Device Algorithm Raspberry Pi 2 V. EXPERIMENTAL RESULTS The above discussed Data Mining algorithms are compared with their respective results and QM analysis for different data. Before analyzing the results of the different Data Mining algorithms, it s necessary to study the trends of the quality metrics which represents the performance of the Data Mining algorithms. Fig.1 shows the trends of the quality metrics parameters. Fig 1. Trends of Quality Metrics Parameters Quality Metrics SSE ICD InCD Characteristics Higher Value: Poor Lower Value: Good Higher Value: Poor Lower Value: Good Higher Value: Good Lower Value: Poor Fuzzy C 3.57 2.031 K- 0.457 0.2320 Fig 5. Timing Analysis (in ms) of Leaf Images Data Device Algorith m Raspberry Pi 2 Fuzzy C 2.497 2.051 K- 0.391 0.221 Fig 6. of Leaf Images Data using K- Fig 2. QM Analysis of Agriculture Data Algorith Quality Metrics Parameters m SSE ICD InCD Fuzzy C 1,387,57 0.816 596,285. 875 71,804,41 0 K- 596,285. 9 689,432 87,891,23 4 69

Fig 7. of Agriculture Data using K- [2] Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Addison-Wesley, 2005 [3] Greg Hamerly, Charles Elkan {ghamerly, elkan} (2003). Learning the k in k-means. [4] Kirkland, Oliver and De La Iglesia, Beatriz (2013) Experimental evaluation of cluster quality measures. [5] Ujjwal Maulik, Member, IEEE, and Sanghamitra Bandyopadhyay, Member, IEEE (2002). Performance Evaluation of Some Algorithms and Validity Indices. VI. CONCLUSION This paper presents a comparative analysis of the performance of different data mining algorithms namely, Fuzzy C and K-. The performance has been compared using various QM parameters namely, sum of squared error, intra cluster distance and inter cluster distance. This paper also presents a comparative analysis of the performance of Raspberry Pi 2 and development boards by performing timing analysis. From the above results, this paper concludes that K- algorithm provides the best clustering results for agriculture data and leaf images data and that is the better development board. The paper also presents a novel procedure for determining the initial number of clusters, K for performing K- clustering which was computed with the help of Batchelor Wilkins algorithm. ACKNOWLEDGEMENTS The authors express their sincere gratitude to Prof. N. R Shetty, Advisor and Dr. H C Nagaraj, Principal, Nitte Meenakshi Institute of Technology for giving constant encouragement and support to carry out research at NMIT. The authors extend their thanks to Vision Group on Science and Technology (VGST), Government of Karnataka to acknowledge our research and providing financial support to setup the infrastructure required to carry out the research. REFERENCES [1] Gonzalez, R. C., & Woods, R. E. (2002). Digital image processing (2nd Ed.).NJ: Prentice Hall. 70