ScienceDirect. Clustering and Classification of Software Component for Efficient Component Retrieval and Building Component Reuse Libraries

Similar documents
An Enhancement of Clustering Technique using Support Vector Machine Classifier

Survey of Algorithm for Software Components Reusability Using Clustering and Neural Network

Spectral Clustering Based Classification Algorithm for Text Classification

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Part I: Data Mining Foundations

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Text clustering based on a divide and merge strategy

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

ScienceDirect. An Efficient Association Rule Based Clustering of XML Documents

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

A SOFT SIMILARITY MEASURE FOR K-MEANS BASED HIGH DIMENSIONAL DOCUMENT CLUSTERING

Available online at ScienceDirect. Procedia Computer Science 46 (2015 )

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Dimension reduction : PCA and Clustering

Optimization using Ant Colony Algorithm

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Data Mining Concepts

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Impact of Term Weighting Schemes on Document Clustering A Review

Machine Learning in Biology

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Semi-Supervised Clustering with Partial Background Information

Density Based Clustering using Modified PSO based Neighbor Selection

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

An Efficient Tree-based Fuzzy Data Mining Approach

A Partition Method for Graph Isomorphism

ScienceDirect. KNN with TF-IDF Based Framework for Text Categorization

A mining method for tracking changes in temporal association rules from an encoded database

Analyzing Outlier Detection Techniques with Hybrid Method

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

PATTERN RECOGNITION USING NEURAL NETWORKS

Multimodal Information Spaces for Content-based Image Retrieval

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Domestic electricity consumption analysis using data mining techniques

Conceptual Review of clustering techniques in data mining field

Supervised and Unsupervised Learning (II)

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Mining of Web Server Logs using Extended Apriori Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Improved Frequent Pattern Mining Algorithm with Indexing

A Hierarchical Document Clustering Approach with Frequent Itemsets

Performance Based Study of Association Rule Algorithms On Voter DB

Datasets Size: Effect on Clustering Results

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

ScienceDirect. Enhanced Associative Classification of XML Documents Supported by Semantic Concepts

A recommendation engine by using association rules

ISSN: [Saurkar* et al., 6(4): April, 2017] Impact Factor: 4.116

An Efficient Approach for Color Pattern Matching Using Image Mining

Available online at ScienceDirect. Procedia Computer Science 46 (2015 )

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

ScienceDirect. Plan Restructuring in Multi Agent Planning

Hierarchical Document Clustering

Contents. Preface to the Second Edition

A Novel Similar Temporal System Call Pattern Mining for Efficient Intrusion Detection

MODULE 1 INTRODUCTION LESSON 1

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Classification by Association

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

An Efficient Algorithm for finding high utility itemsets from online sell

EFFICIENT ALGORITHM FOR MINING FREQUENT ITEMSETS USING CLUSTERING TECHNIQUES

An Improved Apriori Algorithm for Association Rules

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Competitive Intelligence and Web Mining:

Particular experience in design and implementation of a Current Research Information System in Russia: national specificity

Comparison of FP tree and Apriori Algorithm

International Journal Of Engineering And Computer Science ISSN: Volume 5 Issue 11 Nov. 2016, Page No.

A Survey On Different Text Clustering Techniques For Patent Analysis

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation maximization algorithms

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

On Multiple Query Optimization in Data Mining

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

Parallel Popular Crime Pattern Mining in Multidimensional Databases

International Journal of Advanced Research in Computer Science and Software Engineering

A New Approach To Graph Based Object Classification On Images

Efficiently Mining Positive Correlation Rules

Visualization and text mining of patent and non-patent data

Effective Dimension Reduction Techniques for Text Documents

Noval Stream Data Mining Framework under the Background of Big Data

An Improved Technique for Frequent Itemset Mining

An Efficient Clustering Method for k-anonymization

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

ScienceDirect. An Advanced Multi Class Instance Selection based Support Vector Machine for Text Classification

Enhancing K-means Clustering Algorithm with Improved Initial Center

Introduction to Data Mining and Data Analytics

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

Multi-label classification using rule-based classifier systems

Available online at ScienceDirect. Procedia Technology 24 (2016 )

Transcription:

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 31 ( 2014 ) 1044 1050 2nd International Conference on Information Technology and Quantitative Management, ITQM 2014 Clustering and Classification of Software Component for Efficient Component Retrieval and Building Component Reuse Libraries Chintakindi Srinivas a, Vangipuram Radhakrishna b, *,Dr.C.V.Guru Rao c a Kakatiya Institute of Technology and Science, Warangal,INDIA b VNR Vignana Jyothi Institute of Engineering and Technology,Hyderabad,INDIA c Professor of CSE & Principal, S.R.Engineering College,Warangal,INDIA Abstract A Software Repository is a collection of library files and function codes. Programmers and Engineers design develop and build software libraries in a continuous process. Selecting suitable function code from one among many in the repository is quite challenging and cumbersome as we need to analyze semantic issues in function codes or components. Clustering and Mining Software Components for efficient reuse is the current topic of interest among researchers in Software Reuse Engineering and Information Retrieval. A relatively less research work is contributed in this field and has a good scope in the future. In this paper, the main idea is to cluster the software components and form a subset of libraries from the available repository. These clusters thus help in choosing the required component with high cohesion and low coupling quickly and efficiently. We define a similarity function and use the same for the process of clustering the software components and for estimating the cost of new project. The approach carried out is a feature vector based approach. 2014 The The Authors. Authors. Published Published by by Elsevier Elsevier B.V. B.V. Open access under CC BY-NC-ND license. Selection Selection and and/or peer-review peer-review under under responsibility responsibility of the of Organizing the organizers Committee of ITQM of 2014 ITQM 2014. Keywords : component; reuse; similarity; cluster ; mining ; repository 1. Introduction The Universality and Ubiquitous nature of the internet made the possibility to access a huge amount of virtual unlimited information in the digital text format by the humans [17]. This gave a new definition for data mining creating a new set of possibilities for mining text information. * Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000. E-mail address: radhakrishna_v@vnrvjiet.in 1877-0509 2014 The Authors. Published by Elsevier B.V. Open access under CC BY-NC-ND license. Selection and peer-review under responsibility of the Organizing Committee of ITQM 2014. doi: 10.1016/j.procs.2014.05.358

Chintakindi Srinivas et al. / Procedia Computer Science 31 ( 2014 ) 1044 1050 1045 Consequently the text oriented derivation of data mining called as text mining has been gaining a lot of practical significance as the available data grows at higher rate than can be perceived or handled by the humans. In this context, the problem of clustering is one topic that gained more practical importance from the researchers and more specifically from the perspective of the software industry. From software engineering perspective, the requirement for clustering comes from the need for software component classification, software component clustering, analysing project behavior for predicting the project cost behavior, performing software component search and for the component retrieval from software library or software repository. Clustering is also widely used in many practical domains such as text classification, bioinformatics, medicine, image processing to name a few. The component clusters obtained from clustering process may be viewed as the highly cohesive groups with low coupling which is the desired feature. Software component clustering may be defined as the process of grouping similar set of software patterns together [1]. The input to the clustering algorithm can be a set of software entities or software patterns or software requirement documents or any set of software components. The result of the software component clustering process is a set of clusters having high cohesive nature w.r.t other software clusters. The descriptions or representations of the clusters formed may be used for the decision making for selecting a software component or software pattern of interest. The interesting feature or property of the clustering process is that all the software components within a single cluster share common properties in some sense and patterns in different clusters are dissimilar. From perspective of software engineering, all the components within same cluster have high cohesion and low coupling. Clustering is not any one specific algorithm that we can stick firm to, but it must be viewed as the general task to be solved. Clustering algorithms may unsupervised or supervised [1]. In unsupervised clustering the partitions are viewed as the unlabelled patterns or components. Supervised clustering algorithms label the patterns which can be used to classify the components for decision making. Hence the partitions obtained by clustering process may be labeled or unlabeled. A new method called Maximum Capturing is proposed for document clustering [4].Maximum Capturing includes two procedures: 1.Constructing document clusters and 2. Assigning cluster topics. The search complexity can be reduced by using the algorithm [11] where ever necessary as part of component retrieval. 2. Related Works The problem of finding frequent itemsets is first initiated from [9] which use frequent items to find association rules in large transactional databases. In [2] clustering a given set of text documents from neighbor set is proposed. In [3] the authors propose a method for discovering maximum length frequent item sets. In [7], the classification of text files or documents is done by considering Gaussian membership function and d making use of it to obtain clusters by finding word patterns. Each cluster is identified by its word pattern calculated using fuzzy based Gaussian membership functions once clusters are formed. In this paper the idea is to first obtain frequent item sets for each document using existing association rule mining algorithms either by horizontal or vertical approach. Once we find frequent itemsets in each document then we form a Boolean matrix with rows indicating documents and columns indicating unique frequent items from each document. This is followed by the computation of a ternary feature vector for each document pair, represented as a 2D array or 2D matrix by redefining the XNOR function as hybrid XNOR logic with slight modification in the function introducing high impedance variable as Z. The idea of maximum capturing is taken as the base framework for clustering [4]. The authors perform clustering using XOR similarity function [15].

1046 Chintakindi Srinivas et al. / Procedia Computer Science 31 ( 2014 ) 1044 1050 3. Proposed Work 3.1 Similarity Measure To perform clustering, we define a similarity measure which may be used to find the similarity between any pair of software components or software patterns or software documents as given in Table.A. The software components may be software requirement specification files or software product files or program codes. The proposed similarity measure Sim (E1, E2) is a function of any two attributes E1 and E2. We consider E1 and E2 as the attributes present in two different files F1 and F2 respectively. Table A. Proposed Similarity Measure E1 E2 Sim(E1,E2) Absent Absent Neglect (say Z) Absent Present 0 Present Absent 0 Present Present 1 The input for component clustering algorithm is a set of software components with properties predefined and the output is a set of highly cohesive components with low coupling feature. 3.2 Algorithm for Clustering The algorithm may be used to cluster software requirement documents or program codes and this work is an extension of [15, 17]. If the dimensionality is very high, then to reduce the dimension of the documents we may eliminate stop words and stemming words by forming a list of stop words separately and storing in a file. When clustering English documents we may use porter stemmer algorithm. In case, if the dimensionality of software documents is too large to handle, then to further reduce the dimension we may apply Singular value decomposition as given in step1 of the algorithm. Algorithm. Algorithm for Component Clustering. Input: Document set, frequent items. Output: set of clusters. of Algorithm Step1: For each document D do Step1.1 Remove stop words and stemming words from each document. Step1.2 Find unique words in each document and count of the same. Step1.3 Find reduced dimension set by applying Singular Valued Decomposition to all the files. End for Step 2: Form a word set W consisting of each word in reduced item set of each document of step1. Step 3: Form Dependency Boolean Matrix with each row and column corresponding to each Document and each word respectively For each document in document set do

Chintakindi Srinivas et al. / Procedia Computer Science 31 ( 2014 ) 1044 1050 1047 For each word in word set do If (word wk in Word set W is in document Di) Set D[Di, wk] = 1 Else Set D[Di, wk] = 0 End if End for End for Step 4: Find the Feature vector similarity matrix by evaluating similarity value for each component pair applying similarity measure defined in table 1 to obtain the matrix with feature vectors for each component or document pair. Step 5: Replace the corresponding cells of matrix by count of number of zeroes in tri state feature vector. Step 6: At each step, find the cell with maximum value and document pairs containing this value in the matrix. Group such document pairs to form clusters. Also if document pair (X,Y) is in one cluster and document pair (Y, Z) is in another cluster, form a new cluster containing (X, Y, Z) as its elements. Step 7: Repeat Step6 until no components or documents exist or we reach the stage of first minimum value leaving zero entry. Step 8: Output the set of clusters obtained. Step 9: Label the clusters by considering candidate entries. End of algorithm 4. Case Study- Clustering Software Projects to Estimate Project Cost Consider the sample parameters for the software project based on the programming language used, platform, software type, existing plan, and development model [16]. If we need to cluster or classify the set of projects based on these parameters then we can use the proposed algorithm to achieve the task. Let the parameters contain the domain as Programming Language = {C: 1, C++:2, JAVA: 3, COBOL: 4, C##:5} Platform= {Personal Computer: 1, Mobile: 2, Mainframe: 3} S/W TYPE = {Application: 1, Embed: 2} PLAN = {Yes: 1, No: 0} Development Model = {Waterfall: 1, Iterative: 2, Agile: 3, Spiral: 4, Prototype: 5} We form the matrix with rows indicating the projects and columns indicating the factors considered for the project evaluation using direct approach.

1048 Chintakindi Srinivas et al. / Procedia Computer Science 31 ( 2014 ) 1044 1050 Table 1. Matrix showing the combination of Project VS Parameters. PL Platform S/W type Plan Modern Dev. Priority Risks Release type Practice Model Project 1 C PC application no yes waterfall low less 1.0 Project 2 C++ PC application yes yes iterative low Less 1.0 Project 3 Java Mobile embedded no yes agile high mediu 1.0 m Project 4 C Mobile embedded no yes Spiral low less 1.0 Project 5 C Mobile embedded no yes Prototype low less 1.0 Project 6 Cobol mainframe application yes no Iterative medium mediu 2.0 m Project 7 Java Mobile embedded no yes Spiral low less 2.0 Project 8 C# Mobile embedded no yes agile low less 2.0 Table 2. Matrix obtained after index generation. Project Vs Attribute PL Platform S/W type Plan Modern Practice Dev.Mode Priority Risks Release type Project 1 1 1 1 2 1 1 1 1 1.0 Project 2 2 1 1 1 1 2 1 1 1.0 Project 3 3 2 2 2 1 3 2 2 1.0 Project 4 1 2 2 2 1 4 1 1 1.0 Project 5 1 2 2 2 1 5 1 1 1.0 Project 6 4 3 1 1 2 2 3 2 2.0 Project 7 3 2 2 2 1 4 1 1 2.0 Project 8 5 2 2 2 1 3 1 1 2.0 Project 9 3 2 2 2 1 4 1 1 1.0 Table 3. Similarity Matrix obtained from Table.2. P1 P2 P3 P4 P5 P6 P7 P8 P9 P1 X 6 2 6 6 1 4 4 5 P2 X X 2 4 4 4 3 3 4 P3 X X X 5 5 1 5 3 4 P4 X X X X 8 0 7 6 8 P5 X X X X X 0 6 6 7 P6 X X X X X X 1 1 1 P7 X X X X X X X 7 8 P8 X X X X X X X X 6 P9 X X X X X X X X X Stage 1: Choose the cell with maximum value from Table.3. Here (P 4,P 5 ), (P 4, P 9 ), (P 7, P 9 ) have value as 8. Group these cells to form an initial cluster containing projects (P 4, P 5, P 7, P 9 ). We thus have CLUSTER-1: {P 4, P 5, P 7, P 9 }. Stage 2: Choose the cell with next maximum value from Table.4. Here (P 1, P 2 ) are having value 6. Group

Chintakindi Srinivas et al. / Procedia Computer Science 31 ( 2014 ) 1044 1050 1049 these cells to form a cluster containing projects (P 1, P 2 ). Table 4. Reduced Similarity Matrix obtained from Stage 1. P1 P2 P3 P6 P8 P1 X 6 2 1 4 P2 X X 2 4 3 P3 X X X 1 3 P6 X X X X 1 P8 X X X X X We thus have CLUSTER-2: {P 1, P 2 } Stage 3: Choose the cell with next maximum value from Table.5. Here (P 3, P 8 ) are having value 3. Group these cells to form a cluster containing projects (P 3, P 8 ). Table 5. Reduced Similarity Matrix obtained from Stage 2. P3 P6 P8 P3 X 1 3 P6 X X 1 P8 X X X So the cluster formed is CLUSTER-3: {P 3, P 8 } Stage 4 : The final cluster formed is CLUSTER-4: {P 6 } Table 6. Reduced Similarity Matrix obtained from Stage 3. P6 P6 X If we have a new project with parameters specified, we can test to which cluster the new project is similar and then can estimate the nature of the project to perform cost estimation. So the clusters formed are CLUSTER-1: {P 4, P 5, P 7, P 9 }. CLUSTER-2: {P 1, P 2 } CLUSTER-3: {P 3,P 8 } CLUSTER-4 : {P 6 }

1050 Chintakindi Srinivas et al. / Procedia Computer Science 31 ( 2014 ) 1044 1050 5. Conclusion In this paper, we define a new similarity function to compute the similarity between any two software components or software requirement documents. An algorithm to cluster a set of software components is outlined which uses the proposed similarity function to find the similarity among any two software entities. The input to algorithm is a similarity matrix and the output is the set of cohesive clusters. The algorithm is applied to set of projects with predefined parameters and clustering process is carried out from which we can estimate the cost of the project from the properties of previously carried out projects. In future, the approach can be extended to classify the components using classifiers by applying fuzzy logic. The search complexity can be reduced by using the algorithm [11] where ever necessary as part of component retrieval. References 1. V.Susheela Devi, M. Narasimha Murthy. Text Book on Pattern Recognition. An Introduction. University Press. 2. Congnan Luo, Yanjun Li, M. Chung. Text document clustering based on neighbours, Data & Knowledge Engineering, 2009; 68:1271 1288. 3. Tianming Hu, Sam Yuan Sung, Hui Xiong, Qian Fu. Discovery of maximum length frequent itemsets, Information Sciences, 2008; 178: 69 87. 4 Wen Zhanga,, Taketoshi Yoshida, Xijin Tang, Qing Wang. Text clustering using frequent itemsets, Knowledge-Based Systems, 2010; 23: 379 388 5. Wen Zhanga, Taketoshi Yoshida, Xijin Tang. A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Systems with Applications, 2011; 38: 2758 2765. 6. Vincent Labatut and Hocine Cherifi. Accuracy Measures for the Comparison of Classifiers, ICIT 2011 The 5th International Conference on Information Technology. 7. Jung-Yi Jiang et.al A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification. IEEE Transactions on Knowledge and Data Engineering, 2011; 23(3). 8. Melita Hajdinjak, Andrej Bauer. Similarity Measures for Relational Databases, Informatica, 2009; 33:143 149. 9. R.Agrawal, T. Imielinski, A. Swami. Mining association rules between sets of items in very large databases, Proceedings of the ACM SIGMOD Conference on Management of data, 1993: 207 216. 10. F. Beil, M. Ester, X.W. Xu, Frequent term-based text clustering, in: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002: 436 442. 11. Radhakrishna.V, C.Srinivas, C.V.GuruRao. High Performance Pattern Search algorithm using three sliding windows, International Journal of Computer Engineering and Technology, 2012; 3: 543-552. 12. V.Susheela Devi, M. Narasimha Murthy. Text Book on Pattern Recognition. An Introduction. University Press. 13. Salim kebir, Abdelhak-djamel seriai, Sylvain Chardigny. Comparing and Combining Genetic and Cluster Algorithms for Software Component Identification, in the Proceedings of the ACM Fifth International Conference on Computer Science and Software Engineering, 2012: 1-8. 14. Ronaldo.C.Veras, Silvio R.L.Meira, Adriano L.I. Oliveira, Bruno J.M.Melo. Comparitive study of clustering techniques for the organisation of software repositories, in the proceedings of 19th IEEE International Conference on Tools with Artificial Intelligence. 15. Vangipuram Radhakrishna, C.Srinivas, C.V.GuruRao. Document Clustering Using Hybrid XOR Similarity Function for Efficient Software Component Reuse. Procedia Computer Science, 2013; (17): 121-128. 16. Divya Kashyap, A. K. Misra. Software development cost estimation using similarity difference between software attributes. ISDOC '13: Proceedings of the 2013 International Conference on Information Systems and Design of Communication. 17. Chintakindi Srinivas, Vangipuram Radhakrishna, C.V. Guru Rao. Clustering Software Components for Program Restructuring and Component Reuse Using Hybrid XNOR Similarity Function. Procedia Technology, 2014 ; (12): 246-54.