A Method for Improving the Accuracy of Bug. Mining by Replacing Stemming with. Lemmatization

Size: px

Start display at page:

Download "A Method for Improving the Accuracy of Bug. Mining by Replacing Stemming with. Lemmatization"

Jonah Goodwin
5 years ago
Views:

Volume 119 No. 10 2018, 729-735 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.

1 Volume 119 No , ISSN: (printed version); ISSN: (on-line version) url: ijpam.eu A Method for Improving the Accuracy of Bug Mining by Replacing Stemming with Lemmatization 1 C. Arjunchandra, 2 C.J. Sandhya and 3 G. Deepa 1 Department of Computer Science and IT, Amrita School of Arts and Sciences, Kochi, Amrita Vishwa Vidyapeetham, India. arjunchandrakunnummakara@gmail.com 2 Department of Computer Science and IT, Amrita School of Arts and Sciences, Kochi, Amrita Vishwa Vidyapeetham, India. sandhyacj107@gmail.com 3 Department of Computer Science and IT, Amrita School of Arts and Sciences, Kochi, Amrita Vishwa Vidyapeetham, India. deepsgopi@gmail.com Abstract Bug Mining is one of the major research area. If used the text mining technique, contributes a greater amount to overcome bug occurrence in data which helps the developer to a greater extent. So to identify the frequently occurring bugs that match with each other is the solution for the software defective. So in our research paper, we are intending a method for mining the frequent pattern using FP- Growth and lemmatization technique instead of stemming algorithm. Lemmatization is a technique which considers parts of speech to convert group of words to a single word based on the forms in dictionary. The main advantage of using lemmatization techniques over Stemming is that it does not simply crop the inflections in words but carefully removes them by relying upon the knowledge base with lexical interpretations. Thus lemmatization offers better precision over 729

2 the Stemming technique. Key Words:Bug mining, tokenization, lemmatization, stop words, FP growth. 730

3 1. Introduction Our main motive in Software Engineering is to emphasize the productivity, the readability, the reusability, the efficiency, and the quality of software. Different data mining algorithm such as preprocessing, association, classification, clustering to bug repositories to get an effective frequent pattern. In this paper, we have intended a method to find the frequent pattern from different data bugs. Bugs are reported by the analyst, tester and user of the software and these bug report are stored in bug repositories and are managed and tracked by different tools. Preprocessing is a critical step in text classification, giving rise to a relative canonical representation of textual descriptions [3]. A typical preprocessing phase usually consists of the following steps: tokenization, stop word removal and stemming [3]. Our goal is to suggest a text mining method such as tokenization, stop words removal and also by using an FP-growth algorithm with the help of lemmatization. 2. Methods Used in Proposed Work Here we use some methods for bug mining: Text Mining Text mining means taking a given content from textual data. Some text mining techniques are: Tokenization Tokenization is splitting of larger sentences into smaller words. There are different tokenization methods such as n-gram tokenizer, alphabetic tokenizer, and word tokenizer [1]. Example: Input: she is so beautiful Output: she, is, so, beautiful Stemming Stemming is the process of trimming the derived words to their root word [2]. There are some stemming algorithms such as Table-lookup approach,n-gram stemmer, Successor variety, Affix Removal stemmer etc. Lemmatization Lemmatization is the process of grouping a set of words into a single word based on dictionary form. Stop-words Removal We eliminated these stop-words because of additional memory and it is not an informative word also. The words such as this, there, were, etc. are the examples of stop-words. FP-Growth FP-Growth is used for generating frequent patterns from the bug data set. 731

3. Related Work Divyavarma K, Remya M, Deepa G[1] proposed a method for bug mining frequent patterns by applying text mining techniques and FP-Growth Algorithm.

4 3. Related Work Divyavarma K, Remya M, Deepa G[1] proposed a method for bug mining frequent patterns by applying text mining techniques and FP-Growth Algorithm. This work overcomes the issues in bug mining by replacing Apriori Algorithm with the more effective FP-Growth Algorithm. Here, in their work there are methods: tokenization, stop words removal and FP- Growth algorithm for finding frequent pattern. The problem with this work is that they avoided all stemming process because stemming fail when technical words are taken into consideration without stemming we cannot say that the result is accurate. 4. Proposed Work We are intending a method for finding frequent patterns by using Lemmatization instead of Stemming and our proposed work is in Figure 1. As we know, Tokenization is process of the breaking down the sentences into multiple tokens and these tokens can be digits or words. So while considering a bug, Word Tokenizer is the best tokenization method [1].Stop Words do not have much significance in bugs. Hence removing them will not have change in its meaning. For example, There are beautiful flowers growing in the garden. Here in, are, the, there is a stop word. i.e.; {beautiful, flowers, growing, garden}. For example, operation in Linux.The word operation will be converting to oper, after applying stemming technique. So, the entire word meaning has been change as operation into oper. So in this proposed method, we are using lemmatization method. After applying tokenization and stop words removal,we propose lemmatization as the next step in pre-processing instead of Stemming. We selected lemmatization to replace Stemming because Lemmatization helps to find the stem of technical words. Figure 1: Proposed Work 732

Lemmatization means removing the inflected endings and form into a single word. For example, Operation in Linux, after lemmatization, the sentence will be lemmatized as {Operation in Linux}.

Wordnet is considered as the largest lexical database where adverbs, verbs, and adjectives are collected and grouped as a group of synonyms these sets of synonyms are known as synsets.

5 Lemmatization means removing the inflected endings and form into a single word. For example, Operation in Linux, after lemmatization, the sentence will be lemmatized as {Operation in Linux}. Here we use a NLTK Wordnet Lemmatizer to check the technical terms and the text analysis result is shown on Figure 2 and Figure 3. Wordnet is considered as the largest lexical database where adverbs, verbs, and adjectives are collected and grouped as a group of synonyms these sets of synonyms are known as synsets. These synsets are interconnected by means of some lexical relationships between them. Wordnet can be considered as a dictionary or a thesaurus as it groups the words as per their meanings. So, here in lemmatization doesn t change the technical terms (Figure 2). Figure 2: NLTK Wordnet Lemmatizer Figure 3: NLTK Stemmer Stemming algorithms work by cutting the words end, and in some cases looking for the root in the beginning. This random cutting can be successful in some instances, but not always, that is why we state that this approach has some limitations. Here, by using stemming method changes the technical terms and also meaning of that word (Figure 3). To generate bug patterns, we use FP- Growth algorithm. Bug repository may have large amount of bugs in them. The best possible way to generate bug is by using FP-Growth. 5. Conclusion As we know, Stemming is used to minimize each word to its base form whereas lemmatization is used for cropping inflections in words. Our solution has made effective use of text mining by incorporating lemmatization technique through the NLTK wordnet lemmetizer, which yields more precise results by implementing conditional chopping than stemming which uses unconditional chopping. Thus we have proposed that bug mining can be more effective when 733

6 used with lemmatization for mining frequent patterns. This work can be further extended by incorporating stemming algorithms to deal with more bug related issues. References [1] Divyavarma K., Remya M., Deepa G., An Enhanced Bug Mining for Identifying Frequent Bug Pattern using Word Tokenizer and FP-Growth, Advances in Intelligent Systems and Computing 515 (2017), [2] Kiran Kumar B., Jayadev Gyani, Narasimha G., Mining Frequent Patterns from Bug Repositories, IJARCSSE (2014), [3] Zhou Y., Tong Y., Gu R., Gall H., Combining Text Mining and Data Mining for Bug Report Classification, ICSME (2014), [4] Neelima V., Annapurna N., Alekhya V., Vidyavathi B., Bug Detection through Text Data Mining, IJARCSSE (2013), [5] Rashmi S., Nitin S., An Improved Association Rule Mining With Fp Tree Using Positive And Negative Integration, JGRCS (2012), [6] Leon Wu, et.al. Developed a tool BUGMINER: BUGMINER: Software Reliability Analysis via Data Mining of Bug Reports, SEKE, Knowledge Systems Institute Graduate School (2011), [7] Jaweria Kanwal, Onaiza Maqbool, Managing Open Bug Repositories through Bug Report Prioritization Using SVMs, ICOSST (2010), [8] Drkanak S., Rajpoot D.S., A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains, IJCSIS (2009), [9] Philipp Schugerl, Juergen Rilling, Philippe Charland: Mining Bug Repositories A Quality Assessment, CIMCA (2008), [10] Hahsler M., Chelluboina S., Visualizing association rules: Introduction to the R-extension package a rules Viz, R project module (2011), [11] Lamkanfi A., Demeyer S., Giger E., Goethals B., Predicting the severity of a reported bug, 7th IEEE Working Conference on. Mining Software Repositories (2010),

7 735

8 736

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the