Clustering Algorithm for Files and Data in Cloud (MMR Approach)

Similar documents
An Algorithm for Generating New Mandelbrot and Julia Sets

On Demand Web Services with Quality of Service

Digital Image Processing for Camera Application in Mobile Devices Using Artificial Neural Networks

Selection of Web Services using Service Agent: An optimized way for the selection of Non-functional requirements

Document Image Binarization Using Post Processing Method

A Novel Method to Solve Assignment Problem in Fuzzy Environment

SQL Based Paperless Examination System

Two-stage Interval Time Minimization Transportation Problem with Capacity Constraints

Computer Engineering and Intelligent Systems ISSN (Paper) ISSN (Online) Vol.5, No.4, 2014

Numerical solution of Fuzzy Hybrid Differential Equation by Third order Runge Kutta Nystrom Method

Harvesting Image Databases from The Web

Location Based Spatial Query Processing In Wireless System

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Bandwidth Recycling using Variable Bit Rate

Fuzzy k-c-means Clustering Algorithm for Medical Image. Segmentation

Differential Evolution Biogeography Based Optimization for Linear Phase Fir Low Pass Filter Design

A Heuristic Based Multi-Objective Approach for Network Reconfiguration of Distribution Systems

Mobile Ad hoc Networks Dangling issues of optimal path. strategy

Survey on Wireless Intelligent Video Surveillance System Using Moving Object Recognition Technology

Modeling of Piecewise functions via Microsoft Mathematics: Toward a computerized approach for fixed point theorem

Design and Simulation of Wireless Local Area Network for Administrative Office using OPNET Network Simulator: A Practical Approach

The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing

A Novel Technique for Controlling CNC Systems

A Deadlock Free Routing Algorithm for Torus Network

A Novel Approach for Imputation of Missing Value Analysis using Canopy K-means Clustering

Control Theory and Informatics ISSN (print) ISSN (online) Vol 2, No.1, 2012

Numerical Flow Simulation using Star CCM+

MSGI: MySQL Graphical Interface

Multi-Modal Data Fusion: A Description

A File System Level Snapshot In Ext4

Video Calling Over Wi-Fi Network using Android Phones

Computer Engineering and Intelligent Systems ISSN (Paper) ISSN (Online) Vol 3, No.2, 2012 Cyber Forensics in Cloud Computing

On Mobile Cloud Computing in a Mobile Learning System

Dynamic Load Balancing By Scheduling In Computational Grid System

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

An FPGA based Efficient Fruit Recognition System Using Minimum Distance Classifier

A Cultivated Differential Evolution Algorithm using modified Mutation and Selection Strategy

Information and Knowledge Management ISSN (Paper) ISSN X (Online) Vol 2, No.2, 2012

Rainfall-runoff modelling of a watershed

Algorithm for Classification

3D- Discrete Cosine Transform For Image Compression

Recursive Visual Secret Sharing Scheme using Fingerprint. Authentication

A Survey of Model Used for Web User s Browsing Behavior Prediction

Root cause detection of call drops using feedforward neural network

Efficient Retrieval of Web Services Using Prioritization and Clustering

Fundamental Concepts and Models

Gn-Dtd: Innovative Way for Normalizing XML Document

Using Categorical Attributes for Clustering

Offering an Expert Electronic Roll Call and Teacher Assessment System Based on Mobile Phones for Higher Education

Audio Compression Using DCT and DWT Techniques

Performance study of Association Rule Mining Algorithms for Dyeing Processing System

Secure Transactions using Wireless Networks

Datasets Size: Effect on Clustering Results

Query Optimization to Improve Performance of the Code Execution

Mobile Cloud Computing: Implications and Challenges

Utilizing Divisible Load Scheduling Theorem in Round Robin Algorithm for Load Balancing In Cloud Environment

Application of Light Weight Directory Access Protocol to Information Services Delivery in Nigerian Tertiary Institutions Libraries

Design of A Mobile Phone Data Backup System

A Probabilistic Data Encryption scheme (PDES)

Comparative Analysis of QoS-Aware Routing Protocols for Wireless Sensor Networks

Clustering Documents in Large Text Corpora

An Optimized Congestion Control and Error Management System for OCCEM

Voice Based Smart Internet Surfing for Blind using JADE Agent and HTML5 Developing Environment

Comparative Study of Clustering Algorithms using R

Construction and Application of Cloud Data Center in University

Multi-Modal Data Fusion. Sarah Coppock

Computer Engineering and Intelligent Systems ISSN (Paper) ISSN (Online) Vol 3, No.6, 2012

Collaborative Rough Clustering

Semi-Supervised Clustering with Partial Background Information

Adaptive Balanced Clustering For Wireless Sensor Network Energy Optimization

Soft Computing and Artificial Intelligence Techniques for Intrusion Detection System

Face Location - A Novel Approach to Post the User global Location

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

An Anchor Vector Based Similarity Function For Boosting Transactions Clustering. Sam Y. Sung. Robert Ho ABSTRACT

Modelling of a Sequential Low-level Language Program Using Petri Nets

Measuring Round Trip Time and File Download Time of FTP Servers

Clustering Mixed Data Set Using Modified MARDL Technique

An Evolutionary Approach to Optimizing Cloud Services

A Review-Botnet Detection and Suppression in Clouds

Website Vulnerability to Session Fixation Attacks

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

A Heuristic Approach for Web log mining using Bayesian. Networks

Chapter 4. Fundamental Concepts and Models

Proposed Multi-Mode Home Node-B Air Interface Protocol Stack Architecture

CS570: Introduction to Data Mining

A New Technique to Fingerprint Recognition Based on Partial Window

Introduction to Cloud Computing. [thoughtsoncloud.com] 1

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Clustering Method for Mixed Categorical and Numerical Data

CS570: Introduction to Data Mining

Udaipur, Rajasthan, India. University, Udaipur, Rajasthan, India

Multi-class Multi-label Classification and Detection of Lumbar Intervertebral Disc Degeneration MR Images using Decision Tree Classifiers

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

Cloud Computing on Smartphone

Efficient Algorithms/Techniques on Discrete Wavelet Transformation for Video Compression: A Review

EFFICIENT ATTRIBUTE REDUCTION ALGORITHM

Problems faced in Communicate set up of Coordinator with GUI and Dispatcher in NCTUns network simulator

Assistant Professor, School of Computer Applications,Career Point University,Kota, Rajasthan, India Id

S.NO Papers published by the faculty members in National/ International Journals FACTOR

ROUGH SETS THEORY AND UNCERTAINTY INTO INFORMATION SYSTEM

Transcription:

Clustering Algorithm for Files and Data in Cloud (MMR Approach) Shobhit Tiwari * Sourav Khandelwal School of Computing Science and Engineering, Vellore Institute of Technology, Vellore, India Abstract The gradual advancement of technologies like cloud computing, there is enough necessity to improve computing techniques. Speed and security is a problem even in this environment. In order to achieve security, the files and data in any cloud environment need to be anonymized. Clustering has been used for long as a technique to achieve this. The cloud environment is supposed to have uncertainty and the data may be heterogeneous. So, a robust algorithm is necessary in this direction. In this paper we propose an algorithm, which we call as AMMR (Advanced Min-Min Roughness) algorithm by extending the MMR algorithm developed by Tripathy. This algorithm has the characteristics like being stable, handling uncertainty and categorical data. Keywords : Cloud Computing, Database, MMR, Rough Set. 1. Introduction Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics. The criterion for checking the similarity is implementation dependent. Clustering is often confused with classification, but there is some difference between the two. In classification the objects are assigned to predefined classes, whereas in clustering the classes are also to be defined. We can use clustering in different techniques in data mining. Unsupervised classification, data segmentation are some of its applications. We can easily segment the large heterogeneous data sets into smaller homogeneous subsets which are further analyzed after separately modeling. Biomedicine, research and application of radar scanning, development and manufacturing are some of the active areas where clustering techniques have been constantly being used successfully. Cloud computing is a potential area where optimization needs to be done with respect to data clustering. There are or may be not enough sufficient algorithm that can cluster the data. Moreover these algorithms are based more on numerical data, and are not designed to handle uncertainty in the clustering process. Dealing with the cloud where there are different types of files and data that have multi- valued attribute data and file clustering becomes a important issue. There are algorithms related to application of fuzzy set in clustering categorical data have been proposed by Huang ( Halkidi, Batistakis & Vazirgiannis,2001) and Kim et al. ( He, Xu & Deng, 2004). However, these algorithms require multiple runs to establish the stability needed to obtain a satisfactory value for one parameter used to control the membership fuzziness. So, a clustering algorithm that is robust and can, to an extent handle to an extent uncertainty in data clustering. This kind of algorithm was given by (Ganti, Gehrke & Ramakrishnan, 1999), which uses rough set theory. It manages impreciseness based on rough set theory. It also handles uncertainty to a great extent. Given a large data set it can provide with stable results given a input. In the following sections we present the new algorithm which can be considered the advanced version of MMR, analyze its superiority and using advanced clustering techniques cluster data sets in a cloud. 2. Cloud Computing Cloud computing is a delivery platform which is flexible, cost-effective and proven. It basically provides business or consumer IT services over the Internet. The term Cloud is used as a metaphor for the Internet. Regardless of user location or device, Cloud resources can be rapidly deployed and easily scaled, with all processes, applications and services provisioned on demand,. It is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a metered service over a network. Most cloud computing infrastructures consist of services delivered through shared data-centers and appearing as a single point of access for consumers' computing needs. It can be viewed as a combination (have characteristics) of: 1. Client-server model: Distributed application that has a server and client used for communication between a server and client. 2. Autonomic computing: Self-management for Computer System. 3. Mainframe computer: for bulk data processing. 4. Peer to peer: in contrast to client server, both participants being at the same time both suppliers and consumers of resource without the need of central coordination. 5. Utility computing: Pay and use kind of tradition. It is device and location independent. Its characteristic of multi tenancy allows sharing of resources and costs 84

across a large pool of users. It is reliable and scalable. Maintenance is also easier as cloud computing applications need not to be installed in each user s computer. 2.1 Types of cloud 1. Hybrid Cloud: It is a combination of two or more private, community or public cloud. It further provides opportunity for further deployment. 2. Private cloud: A cloud which is developed and deployed solely for a single organization. It can be either managed the organization itself or third party. 3. Intercloud : Extension of internet which is a interconnected global clouds of clouds. 4. Community cloud: It is a cloud that is shared between different organizations that can be either managed internally or by the third party. There are three types of Cloud models: 1. SAAS: Software as a Service, which provides software as and when required. 2. PAAS: Platform as a Service provides platform, middleware, databases, development tooling etc when and as required. 3. IAAS: Infrastructure as a Service provides support of Servers, Networking, and Storage as and when required by the user. 3. MMR Algorithm The MMR algorithm is used for clustering data tables with objects having multiple attributes. Here, this algorithm finds importance in PAAS (platform as a service) which provides database management and services. In general, in a cloud, we may have many objects with heterogeneous attributes say our data table may contain pictures, videos, audio etc. Now these objects may have different attributes for example a picture may have attribute of ( date, size, format, location, dimensions ).We may need to cluster the data tables according to our need for this we must first identify the pictures (say by format identification for e.g. file formats with.jpg,.jpeg,.gif are more likely to be pictures) and then extract the attributes in a data table, and cluster according to the algorithm discussed. These concepts can be further extended according to our needs by extracting the multiple attributes of the required files and then clustering the data tables. 4. Advanced MMR Nomenclature used in Advanced MMR is shown in Table 1. 4.1 Algorithm Select clustering type Switch: Case i) file type //clustering based on type of file attribute; Loop{array[attributes]: = Getfile(attributes)}; Create a separate database of attributes collected through functions; //for example if (jpg, jpeg, png, gif file extensions etc are identified and placed under a column named pictures ) //Similarly for videos etc; Call procedure AMMR(U,k) //clustering algorithm Break; Case ii)data type; Loop{array[attributes]:=Getdata(attributes)}; Create a separate database of attributes collected through functions; Call procedure Ammr(U,k) //clustering algorithm Break; Procedure AMMR(U, k) //after getting data table of attributes; Begin Set current number of cluster CNC = 1 Set NodeParent = U Loop1: If CNC < k and CNC 1 then ParentNode = ProcParentNode (CNC) End if // Clustering the nodeparent For eacha i A (i = 1 to n, where n is the number of attributes in A) Determine m Ind(a i) [x ] (m = 1 to number of objects) 85

For each a j A(j = 1 to n, where n is the number of attributes in A, j i) Rough aj(a) i Next Min-Roughness (a i ) = Min (Rough aj (a i )) Next Set Min Min-Roughness =Min (Min-Roughness (a i )), i = 1,...,n Determine splitting attribute a i corresponding to the Min Min-Roughness Do binary split on the splitting attribute ai CNC = the number of leaf nodes Go to Loop 1 End ProcParentNode (CNC) Begin Set i = 1 Do until i < CNC Size (i) = Count (Set of Elements in Cluster i) i = i + 1 Loop Determine Max (Size (i)) Return (Set of Elements in cluster i) corresponding to Max (Size (i)) End Calculate 5. Handling of different types of data This AMMR algorithm is designed to cluster the data and files available in a cloud. It first extracts the attributes and other information from the network and creates a separate database for example if (jpg, jpeg, png, gif file extensions etc are identified and placed under a column named pictures ) Similarly for videos etc. Then it uses the MMR logic to cluster the data sets. For clustering of numerical data we have defined classes. Here we have taken minimum of n of the number of equivalence classes of all attributes containing categorical data, and does not divide the number of objects then its ceiling or floor is selected, whichever leads to merging of less number of objects. So, we merge the elements which are nearest possible. This iterative step ends when the above condition is satisfied. The merged set is termed as an element and all its elements are included in same class. The AMMR algorithm emphasizes on finding the minimum of roughness values for every equivalence class of each attribute and choosing the attribute possessing the equivalence class with least minimum as splitting attribute rather than choosing the attribute which has the least roughness with respect to any attribute. 5.1 Clustering Process After getting the attributes in the database and after splitting the objects into two parts, we find the average distance between every two tuples in each cluster. For finding the distance we shall use the method of Hamming distance, which gives the difference between two objects, taking the equality or otherwise of the respective attributes into consideration. If the respective attribute values are equal we add zero and add 1 otherwise. Comparing the average distances in the clusters, the record having the least distance is selected. If clusters with same average distances are found then we choose the one which has more number of objects. In case of a tie a random selection is made. 6. Conclusions No or very few algorithms have described approach to cluster the data, files and network in a cloud. Here we have introduced a criterion to find a distance between any two data objects by generalizing Hamming distance. The object is created by selecting attributes from files and networks. Our Other achievement is in choosing cluster for re-clustering and handling of heterogeneous data and files. Future enhancements of this algorithm can be made in the fields of selection of splitting attribute by introducing fuzzy properties. This will lead to development of rough- fuzzy concepts in clustering. References Dempster, A., Laird, N., Rubin, D. (1977), Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society 39 (1),pp 1 38. Ganti, V., Gehrke, J. Ramakrishnan, R. (1999), CACTUS clustering categorical data using summaries, in: Fifth 86

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73 83. Gibson, D., Kleinberg, J., Raghavan, P. (2000), Clustering categorical data: an approach based on dynamical systems, The Very Large Data Bases Journal 8 (3 4), pp 222 236. Guha, S. Rastogi, R. Shim, K. (2000), ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5), pp 345 366. Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001), On clustering validation techniques, Journal of Intelligent Information Systems 17 (2 3), pp 107 145. Han, E., Karypis, G., Kumar, V., Mobasher, B. (1997), Clustering based on association rule hypergraphs, in: Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 9 13. He, Z., Xu, X., Deng, S. (2004): A link clustering based approach for clustering categorical data, Proceedings of the WAIM Conference, <http://xxx.sf.nchc.org.tw/ftp/cs/papers/0412/0412019.pdf>. He, Z., Xu, X., Deng, S. (2002), Squeezer: an efficient algorithm for clustering categorical data, Journal of Computer Science & Technology 17(5), pp 611 624. Huang, Z. (1998), Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2 (3), pp 283 304. Kim, D., Lee, K., Lee, D. (2004), Fuzzy clustering of categorical data using fuzzy Centroid Pattern Krishnapuram, R., Frigui, H., Nasraoui, O. (1995), Fuzzy and possibilistic shell clustering algorithms and their application to boundary detection and surface approximation, IEEE Transactions on Fuzzy Systems 3 (1), pp 29 60. Krishnapuram, R., Keller, J. (1993), A possibilistic approach to clustering, IEEE Transactions on Fuzzy Systems 1 (2), pp 98 110. Pawlak, Z. (1991), Rough Sets Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Boston. Pawlak, Z. (1997), Rough set approach to knowledge-based decision support, European Journal of Operational Research 99 (1), pp 48 57. Ralambondrainy, H. (1995), A conceptual version of the K-means algorithm, Pattern Recognition Letters 16 (11), pp 1147 1157. Ruspini, E. (1969), A new approach to clustering, Information Control 15 (1), pp 22 32. Zhang, Y., Fu, A., Cai, C., Heng, P. (2000), Clustering categorical data, in: Proceedings of the 16th International Conference on Data Engineering, pp. 305 324. Sourav Khandelwal is currently pursuing his B.Tech in Computer Science and Engineering from Vellore Institute of Technology. His research interests include Analytical Problem solving, Cloud Computing. He also has a passion for coding. Shobhit Tiwari is currently pursuing his B.Tech in Computer Science and Engineering from Vellore Institute of Technology. His research interests include Wireless Networks, E-learning and Algorithm Design. Table 1. Nomenclature used in Advanced MMR Algorithm Symbol Meaning U Universe or the set of all objects (x1, x2,.. ) X X subset of the set of all objects, (X U) x i object belonging to the subset of the set of all objects, xi X A the set of all attributes (features or variables) a i attribute belonging to the set of all attributes, ai A. V(a i ) set of values of attribute ai (or called domain of ai) non-empty subset of A(B A) B X B lower approximation of X with respect to B upper approximation of X with respect to B X B Rai( x ): roughness with respect to {ai} [X i ] Ind(B Equivalence class of x i in relation Ind(B), also known as elementary set in B. 87

The IISTE is a pioneer in the Open-Access hosting service and academic event management. The aim of the firm is Accelerating Global Knowledge Sharing. More information about the firm can be found on the homepage: http:// CALL FOR JOURNAL PAPERS There are more than 30 peer-reviewed academic journals hosted under the hosting platform. Prospective authors of journals can find the submission instruction on the following page: http:///journals/ All the journals articles are available online to the readers all over the world without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. Paper version of the journals is also available upon request of readers and authors. MORE RESOURCES Book publication information: http:///book/ Academic conference: http:///conference/upcoming-conferences-call-for-paper/ IISTE Knowledge Sharing Partners EBSCO, Index Copernicus, Ulrich's Periodicals Directory, JournalTOCS, PKP Open Archives Harvester, Bielefeld Academic Search Engine, Elektronische Zeitschriftenbibliothek EZB, Open J-Gate, OCLC WorldCat, Universe Digtial Library, NewJour, Google Scholar