Identifying Idioms of Source Code Identifier in Java Context

Similar documents
The Earley Parser

V. Thulasinath M.Tech, CSE Department, JNTU College of Engineering Anantapur, Andhra Pradesh, India

LING/C SC/PSYC 438/538. Lecture 3 Sandiway Fong

LAB 3: Text processing + Apache OpenNLP

The CKY algorithm part 1: Recognition

Block-Matching Twitter Data for Traffic Event Location

Compilers and Interpreters

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

Fast and Effective System for Name Entity Recognition on Big Data

Technique For Clustering Uncertain Data Based On Probability Distribution Similarity

A NATUWAL LANGUAGE INTERFACE USING A WORLD MODEL

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

The CKY algorithm part 1: Recognition

Restricted Use Case Modeling Approach

Graph-based Entity Linking using Shortest Path

Christoph Treude. Bimodal Software Documentation

Exam III March 17, 2010

A Brief Incomplete Introduction to NLTK

Compilers. Lecture 2 Overview. (original slides by Sam

A study on improvement of evaluation method on web accessibility automatic evaluation tool's <IMG> alternative texts based on OCR

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

What do Compilers Produce?

Ortolang Tools : MarsaTag

Automatic Text Summarization System Using Extraction Based Technique

Text Mining for Software Engineering

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Transforming Requirements into MDA from User Stories to CIM

Parsing. Parsing. Bottom Up Parsing. Bottom Up Parsing. Bottom Up Parsing. Bottom Up Parsing

Principles of Programming Languages COMP251: Syntax and Grammars

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Conceptual and Logical Design

Using NLP and context for improved search result in specialized search engines

Compiler Design 1. Introduction to Programming Language Design and to Compilation

Army Research Laboratory

Admin PARSING. Backoff models: absolute discounting. Backoff models: absolute discounting 3/4/11. What is (xy)?

Comment-based Keyword Programming

Introduction to Programming Languages and Compilers. CS164 11:00-12:30 TT 10 Evans. UPRM ICOM 4029 (Adapted from: Prof. Necula UCB CS 164)

Apache UIMA and Mayo ctakes

Occam & C++ Translator

Pioneering Compiler Design

ACCESSING DATABASE USING NLP

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

Application of recursion. Grammars and Parsing. Recursive grammar. Grammars

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation

CSE 417 Dynamic Programming (pt 6) Parsing Algorithms

Smart Image Search System Using Personalized Semantic Search Method

NLP - Based Expert System for Database Design and Development

Computer Components. Software{ User Programs. Operating System. Hardware

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

A Personal Information Retrieval System in a Web Environment

Enabling Semantic Search in Large Open Source Communities

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation

String Vector based KNN for Text Categorization

Introduction to Software Engineering: Analysis

1DL321: Kompilatorteknik I (Compiler Design 1)

Refresher on Dependency Syntax and the Nivre Algorithm

Basic Concepts. Computer Science. Programming history Algorithms Pseudo code. Computer - Science Andrew Case 2

Lexical Analysis. Chapter 2

NLP Final Project Fall 2015, Due Friday, December 18

Documentation and analysis of an. endangered language: aspects of. the grammar of Griko

Document Structure Analysis in Associative Patent Retrieval

Formats of Translated Programs

Objectives. Coding Standards. Why coding standards? Elements of Java Style. Understand motivation for coding standards

Parsing tree matching based question answering

CS 224N Assignment 2 Writeup

Lecture 14: Annotation

Precise Medication Extraction using Agile Text Mining

Encoding Words into String Vectors for Word Categorization

A Comprehensive Analysis of using Semantic Information in Text Categorization

Semantic sentence structure search engine

The Rise of the (Modelling) Bots: Towards Assisted Modelling via Social Networks

Generating Animation from Natural Language Texts and Semantic Analysis for Motion Search and Scheduling

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Working of the Compilers

Structural Text Features. Structural Features

Question Answering Using XML-Tagged Documents

Design of Good Bibimbap Restaurant Recommendation System Using TCA based on BigData

NLP in practice, an example: Semantic Role Labeling

COMPILER DESIGN. For COMPUTER SCIENCE

Inter-Annotator Agreement for a German Newspaper Corpus

Converting the Corpus Query Language to the Natural Language

Algorithms for NLP. Chart Parsing. Reading: James Allen, Natural Language Understanding. Section 3.4, pp

Lab II - Product Specification Outline. CS 411W Lab II. Prototype Product Specification For CLASH. Professor Janet Brunelle Professor Hill Price

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

CS101 Introduction to Programming Languages and Compilers

Syntax and Grammars 1 / 21

Modeling Crisis Management System With the Restricted Use Case Modeling Approach

Web Product Ranking Using Opinion Mining

Compiler Construction

Lexical Analysis. Lecture 2-4

Design and Implementation of HTML5 based SVM for Integrating Runtime of Smart Devices and Web Environments

Review on Text Mining

SURVEY PAPER ON WEB PAGE CONTENT VISUALIZATION

GRADES LANGUAGE! Live, Grades Correlated to the Oklahoma College- and Career-Ready English Language Arts Standards

Transcription:

, pp.174-178 http://dx.doi.org/10.14257/astl.2013 Identifying Idioms of Source Code Identifier in Java Context Suntae Kim 1, Rhan Jung 1* 1 Department of Computer Engineering, Kangwon National University, Kangwon-Province, South Korea, {stkim, jungran}@kangwon.ac.kr Abstract. This paper presents an approach to identifying a domain word POS(Part of Speech) and idiom code identifiers written in Java programming language. To detect them, we extracted common identifiers from 14 Java API documents, and applied diverse filters. In addition, NLP (Natural Language Parser) has been used to detect common mistakes in the Java API documents. As a result, this paper identified 80 idioms from the documents. Keywords: Idioms, Source Code Identifiers, Inconsistency 1 Introduction A source code identifier (e.g., method or class identifiers) is one of the key means for comprehending a software system for software maintainer. Specially, it becomes more important when the system does not have enough documents describing the internal structure of the system. In naming code identifiers, developers tend to constantly use a single POS(Part-Of-Speech) of a word, and common idioms that the only people possessing the domain knowledge of the application area can understand regardless. Identifying the domain terms and idioms is valuable not only for writing an understandable source code, but also for reviewing the validity of the source code identifiers. There has been some work to improve source code readability (e.g., see [1][5]). Most studies are based on software metric, measuring extent of readability using indicators such as line length of a method, number of comments and keywords. Although the approach is helpful to characterize quality of the source code, software developers do not have any concrete hints or guidelines while naming source code elements. In this paper, we present an approach to identifying a domain word POS and idiom code identifiers written in Java programming language. We define an idiom as a word or a combination of words that violates grammar rules of Java Naming Convention throughout diverse Java projects. In addition, the domain word POS indicates a specific word POS that only used in a specific domain. To identify idioms and a domain word POS, we use the Java API(Application Programming Interface) ISSN: 2287-1233 ASTL Copyright 2013 SERSC

Document as a input, and grammatically analyze all identifiers of classes, methods and attributes introduced in the API documents. Then, domain words that are usually discovered in diverse projects are detected, and their POS are mined through several filters. Also, we check violation of an identifier against the grammar rules, so that idioms are detected if the violation is commonly discovered throughout several java projects. In identifying the domain word POS and idioms, we have utilized popular 14 Java API documents including Java SE(Standard Edition) and various Apache projects, which can be considered as core libraries for Java based software systems. 2 Identifying Idioms of Source Code Identifiers This section introduces how we automatically identify a domain word POS and idioms. Fig. 1 shows the overview of our approach. It first starts with analyzing 18 java or library API documents. Then, it tokenizes all identifiers of classes, methods and attributes and parses each identifiers with a NLP(Natural Language Processing) parser[6] so that the results are stored in the intermediary storage. Based on the results, domain words and its POS are identified, and idiom identifiers are also extracted via passing them through several filters. Java/Library API Documents Building Code Dictionary Parsing API Docs Models - Class - Attribute - Method Extracting Idioms Detecting Domain Word POS Idioms Domain Word POS Fig. 1. An Overview of Our Approach Step1. Parsing API Documents Domain words and Idioms are extracted fromjava or Common LibraryAPI(Application Programming Interface) documents. Obtaining code identifiers from API documents is easier and more reliable approach rather than examining source code, because code identifiers from API are officially suggestible to Copyright 2013 SERSC 175

users of the API. We collected class, method, and attribute identifiers from 14 API documents as shows in Table 1. Due to the space limit, we just listed the representative libraries. Library Version Description Java Development Kit 1.7.0 Java standard development kit Apache Ant 1.8.3 Java library and command-line tool for building java project Apache POI 3.9 APIs for manipulating various file formats based upon Microsoft Table 1. Java/Library API Documents for the Code Dictionary Then, an NLP parser assigns POS to each word composing of an identifier. To parse an identifier using an NLP parser, an identifier is tokenized first according to rules from the camel case[3]. Then, a blank is inserted between tokenized terms, and a period is inserted at the end of the last term especially for a method identifier to make a complete sentence. This is because a method identifier constitutes a verb phrase according to the Java Naming Convention[2], which is a complete sentence without a subject. The converted identifier is transferred into a NLP parser to analyze POS of each term. As an example, a method identifier getwordset() is converted into get word set.. Then, the parser analyzes POS of the sentence, resulting in [(VP (VB get) (NP(NN word)(nn set)(..)))], where VP, VB, NP and NN denote a verb phrase, a verb, a noun phrase and a noun respectively. The results are stored in each model as an intermediate storage. Step2. Detecting Domain Word POS Domain words and its POS cover computer domain specific words. The word file, as an example, is frequently used as a noun or a verb in a natural language, while it is generally used only as a noun indicating the way of storing data in the second storage in a computer domain. Detecting domain word and its POS consists of two sub-steps. The first step is that words discovered less than 100 times are filtered out among all candidate words. This is because domain words are used frequently rather than other non-domain words in source code identifiers. As the second step, if a POS in POSs of a word occupies 90% of total usage, the POS is considered as a dominated POS of the word and the word and its POS is identified as a domain word. This is because a domain word are generally used as a single dominated POS. Fig. 2 shows the steps with sample words. All words extracted from the API documents are 14,642 words, each word includes number of POS usages (see left part in the figure). In the figure, nn, vb, adj and adv denotes a noun, a verb, an adjective and an adverb respectively. After passing the first 100 Uses Filter, words having a dominant POS pass the second 90% Filter. With the two filters, we have obtained 183 domain words and its pos including thread and file. The domain words and its POS are used to increase precision of the NLP parser and inconsistent identifiers. 176 Copyright 2013 SERSC

thread(nn-147, vb-1, adj-1) thread(nn-147, vb-1, adj-1) 98% file(nn-689, vb-11, adj-9, adv-2) thread(nn) file(nn-689, vb-11, adj-9, adv-2) 96% file(nn) name(nn-273, vb-436) name(nn-273, vb-436) 61% close(nn-49, vb-21, adv-8) 14,642 words 218 words 183 words 100 Uses Filter 90% Filter Fig. 2. Detecting Domain Words and Its POS Step3. Detecting Idiom Identifiers An idiom identifier is a single or compound identifier that commonly violates grammatical rules of Java naming conventions. Due to its common usages, they are generally acceptable regardless of violation of the grammatical rules. The method keyset() and length() are a noun phrase or a noun respectively. However, the Java naming convention guides that the method should be a verb or a verb phrase. Building idiom identifiers are motivated by observation of the code. We discovered that a large number of identifiers does not observe the grammatical rules, even Java Development Kit[4] contains many identifiers violating the rules, and developers does not even think that is violation. Thus, we created idiom identifiers to reflect it to increase precision of inconsistent identifier detection. Building idiom identifiers consists of two steps as shows in Fig. 3. It starts with results from parsing API documents, which contains 98,320 identifiers. They passes through the Frequency Filter which filters out identifiers that are not discovered throughout at least three API documents, and less than three usages for a class and an attribute, ten usages for a method. The identifiers that passed through the filter are 800. The second step filter is Phrase-POS Filter detecting identifiers that violate grammatical rules. As a result, we extracted 80 idiom identifiers including abs(), available(), elements(), intvalue(), length() and values() as representative method idiom identifiers, ALL, debug and verbose as representative attribute idiom identifiers. Class (9,474) Method (73,236) Attribute (15,610) Frequency Filter Class (0) Method (623) Attribute (177) Phrase-POS Filter Class(0) Method(67) Attribute(13) 98,320 identifiers 800 identifiers 80 idiom identifiers Fig. 3. Detecting Idiom Identifiers Copyright 2013 SERSC 177

3 Conclusion and Future Work In this paper, we have presented an approach to identifying domain words and its POS, and idioms frequently discovered throughout diverse Java or Java based API documents. Based on the the domain words and idioms, we strongly believe that software developers can adopt the domain terms while developing their application of the application area, and software maintainers can comprehend the system in an easier manner. This research is also valuable as one of the steps to increase the quality of the source code identifiers. As a future work, we are planning to apply this approach to detect inconsistent identifiers of the Java based software systems. References 1. R. Buse and W. Weimer. A Metric for Software Readiability. In Proceedings of International Symposium on Software Testing and Analysis(ISSTA), pages 121 130, Seattle, WA, 2008. 2. Code Conventions for the Java Programming Language: Why Have Code Conventions, Sunmicro Systems,1999. http://www.oracle.com/technetwork/java/index-135089.html. 3. F.Deibenbock andm. Pizka. Concise and Consistent Naming. In Proceedings of International Workshop on Program Comprehension(IWPC), pages 261 282, St. Louis, MO, USA, 2005. 4. Java Standard Edition. http://www.oracle.com/technetwork/java/javase/overview/index.html. 5. D. Posnett, A. Hindle, and P. Devanbu. A Simpler Model of Software Readiability. In Proceedings of International Confernce on Mining Software Repository(MSR), pages 73 82, Honolulu, Hawaii, 2011. 6. The Stanford Parser: A statistical parser, Home page,2012. http://nlp.stanford.edu/software/lex-parser.shtml. 178 Copyright 2013 SERSC