Natural Language Processing Is No Free Lunch

Size: px
Start display at page:

Download "Natural Language Processing Is No Free Lunch"

Transcription

1 Natural Language Processing Is No Free Lunch STEFAN WAGNER UNIVERSITY OF STUTTGART, STUTTGART, GERMANY ntroduction o Impressive progress in NLP: OS with personal assistants like Siri or Cortan o Brief check on how and how not to apply NLP in software analytics. o Study case: NLP applied over the documentation of software systems. o Most of the documentation, although structured and versatile (JavaDoc and Doxygen), focuses on the level of functions/methods and classes, while the component level is often missing. Why don t use the former to generate the latter? NL Data in Software Projects How to apply NLP: 2 techniques o There a lot of NL in a software project: o Textual documentation for the user or the software architecture. o Commit messages and issue descriptions. o Comments in the source code. o Nowadays there s a wide range of algorithms that allows processing large NL datasets. o Part-of-speech tagging: Returns the grammatical use of each word (verb, noun, or determiner) o Topic modeling: Extracts the most probable topics in a NL dataset. o Stemming Process of removing morphological nd inflexional endings from words. Read and reading à they both orrespond to the same word. o Lemmatization It employs dictionary and morphological analysis to return the base form of a word (called lemma Better has lemma good.(stemmin would miss it).

2 NLP is no magic o You have to ty alternative algorithms and tune them during the analysis. o The results will depend strongly on the quality of the analyzed NL data: o The analysis of an official Java library, which is well documented, will work well, while o The analysis of open source code (which has fewer comments) will provide less useful results. o Don t discard manual analysis o Humans can make connections bas on their own knowledge and experience, and they can formulate results easily accessible to other humans. o You can use manual feedback combined with NLP techniques. o Systematic manual analysis: It requires a simple coding of the textu data, where a tag or code (a word, a sentence, or a whole paragraph) is attached to the piece of NL analyzed Can Clone Detection Support Quality Assessments of Requirements Specifications? Elmar Juergens, Florian Deissenboeck, rtin Feilkas, Benjamin Hummel, rnhard Schaetz, Stefan Wagner Technische Universität München Garching b. München, Germany Christoph Domann, Jonathan Streit itestra GmbH Garching b. München, Germany ntroduction o Software requirements specifications (SRS) are the keystone of most software projects. o They influence the product s quality and the effort spent on development. o They are the key (and often only) communication artifact between customer and contractor. o SRS are mostly written in NL à Few techniques for automated quality assessment. o It s possible to use clone detection to tackle redundancy. o Clone detection is commonly applied to find duplications in source code (cloning), which can. o Increase a project's size and the effort required for size-related activities. o Lead to errors, caused by inconsistent changes. Research Problem o 4 objectives: o Do real-world requirements specifications contain duplicated informatio o What kind of information is duplicated? o Which consequences the duplication of information has on the different software development activities. o Can existing clone detection approaches be applied in practice to identi duplication in SRS automatically? o 28 SRS analyzed à a total of 8,667 pages.

3 Terminology o Requirements specification (RS): a specification for a particular software product, program, or set of programs that performs certain functions in a specific environment. o A RS can be interpreted as a single sequence of words. o A normalized RS is obtained when its set of words is transformed by grouping sets of similar words. o A specification clone is a substring of the normalized specification with a certain minimal length, appearing at least twice. o A clone group contains all clones of a specification that have the same content. o Clones of a relevant clone group must convey similar information and this information must refer to the system described à system interaction steps. o Clone coverage denotes the part of a specification that is covered by cloning. represents the probability that an arbitra chosen specification sentence is cloned least once. o Number of clone groups and clones denotes how many different logical specification fragments have been copie and how often they occur. o Blow-up describes how much larger the specification is compared to a hypotheti specification version that contains no clo Methodology: Study Definition o 4 research questions : o RQ1. How much cloning do real-world requirements specifications contain? o RQ2. What kind of information is cloned in requirements specifications? o RQ3. What consequences does cloning in requirements specifications have? o RQ4. Can cloning in requirements specifications be detected accurately using existing clone detectors? o Content analysis of study objects specifications from industrial projects performed with clone detection and manually. Methodology: Study Design. First, RS are assigned randomly to pairs of researchers for further analysis.. Clone detection is performed on all documents of a specification.. The researcher pairs perform clone detection tailoring for each specification.. They manually inspect detected clones for false positives, adding filters to remove the appearance of these false positives. o The sequence 2, 3, 4 is repeated until no false positives are found in a random sample of the detected clone groups. Methodology: Study Design o For each specification, a random sample of clone groups is analyzed, base on the kind of information they contain. The clone is assigned to all suitable categories. o On selected specifications, content analysis of the source code is performe The code corresponding to specification clones is studied in order to classi whether the specification cloning resulted in code cloning, duplicated functionality without cloning, or was resolved through the creation of a shared abstraction.

4 Methodology: Study Objects o The 28 RS are from various domains: administration, automotive, convenience, finance, telecommunication, and transportation. o The RS were obtained from different organizations, including: o Munich Re Group: one of the largest reinsurance companies in the world and employs more than 47,000 people in over 50 locations. o Siemens AG: the largest engineering company in Europe. o MOST Cooperation: a partnership of car manufacturers (including Audi, BMW and Daimler) and component suppliers. Methodology: Study Implementation and Execution o RQ1: The tool ConQAT is used to perform clone detection and to compute the clone measures. Detection is performed with a minimal clone length of 20 words. o RQ2: If more than 20 clone groups are found for a specification, the manua classification is performed on a random sample of 20 clone groups. Else, a clone groups for a specification are inspected. During inspection, 8 categories were added and 1 was changed. o RQ3: Relative blow-up is computed as the ratio of the total number of word to the number of redundancy-free words. Absolute blow-up is computed as the difference of total and redundancy free number of words. Methodology: Study Implementation and Execution o An average reading speed of 220 words per minute was used to calculate the additional effort for reading, while for the inspection task the metric corresponded to 600 words per hour. o RQ4: Precision is determined by measuring the percentage of the relevant clones in the inspected sample. Clone detection tailoring is performed by creating regular expressions that match the false positives. A maximum number of 20 randomly chosen clone groups is inspected in each tailoring step. The Clone Detection Tool ut and Pre-Processing The input phase reads the documents and produce a normalized word stream (using the Porter stemmer algorithm). It requires all the input data to be plain text. After reading the text contents of a specification, certain sections of the documents are excluded. The resulting text is split into single words; whitespace and punctuation is discarded. ection This phase extracts all substrings in the word stream that are sufficiently long and occur at least twice. The algorithm works by constructing a suffix tree from the token (word) stream. Each branch of the tree which reaches at least two leaves corresponds to a clone. st-processing and Output During post-processing, all clone groups which contain overlapping clones are removed. The output phase calculates several metrics on the clones.

5 Results: RQ1 Amount of Cloning Results: RQ2 Cloned Information o Clone group cardi the number of tim specification fragme been cloned. 1. Detailed Use Case Steps 2. Reference 3. UI 4. Domain Knowledge 5. Interface Description 6. Pre-Condition 7. Side-Condition 8. Configuration 9. Feature 10. Technical Domain Knowledge 11. Post-Condition 12. Rationale Results: RQ3 Consequences of CloningResults: RQ3 Consequences of Cloning ecification Reading o The average blow-up of the analyzed SRS is 3,578 words which, at typical reading speed of 220 words per minute translates to additional 16 minutes. o This amount increases to 6 hours for the inspection task. ecification Modification o The comments documented during the inspection of the sampled clones were analyzed (for each specification set). They refer to duplicated specification fragments that are essentially longer than the clones detected by the tool. Specification Implementation For the inspected 20 specification clone groups and their source codes 3 different effects were found: 1. The redundancy in the requirements is not reflected in the code. It contains shared abstractions that avoid duplication. 2. The code that implements a cloned piece of an SRS is also cloned. In this case, future changes to the cloned code cause additional efforts as modification must be reflected in all clones. Furthermore, changes to cloned code are errorprone as inconsistencies may be introduced accidentally. 3. Code of the same functionality has been implemented multiple times. This kind of redundancy is harder to detect as existing clone detection approaches cannot find code that is functionally similar but not the result of copy&paste

6 Results: RQ4 Detection Tailoring and Accuracy e false positives contain information from e following categories Document meta data comprises information about he creation process of the document. ndexes do not add new information and are ypically generated automatically by text processors. Page decorations are typically automatically nserted by text processors. Open issues document gaps in the specification..e. TODO statements. Specification template information contains section names common to all individual documents hat are part of a specification. o By using clone detection tailoring precisio were above 85%, with an average of 99% o The time required for tailoring varies betw and 33 minutes across specifications, wit average value of 10 minutes. Conclusions and Future Work o The amount of cloning encountered is significant. However, as shown with the broad spectrum of findings, cloning in SRS can be successfully avoided o Cloning is not confined to a specific kind of information. o The most obvious effect of duplication is the increased size, which could be avoided by cross references or different organization of the specifications. Another consequence is the increase on the time spent reading the RS. o Redundancy may lead to inconsistent changes of the clones, which my induce errors in the RS and thus in the final system. o Specification cloning can lead to cloned or re-implemented parts of code. Conclusions and Future Work o Excising clone detection approaches can be applied to identify cloned information in SRS. However, a certain amount of analysis tailoring is required to increase detection precision. o Without any pervious knowledge, one must assume that the probability thao Subjectivity during the categorization of the cloned information à an arbitrary sentence in the specification is duplicated is greater than 10%. Researches in pairs and inter-rater agreement. o One should make SRS authors and reviewers aware of the problems that SRS cloning has and avoid redundancy from the beginning on. Threats to Validity: Internal Validity o Results influenced due to mistakes or individual preferences of researches during the tailoring phase à Clone tailoring in pairs. o Precision was determined on random samples. o Inaccurate calculation of additional effort due to blow-up. o Cloned and non-cloned text treated uniformly with respect to reading efforts o Few interest in detection recall. o Sometimes duplication is employed intentionally in order to make a part of SRS self-contained. In this case, you have to make sure that the duplicatedo No research of false negatives, the amount of duplication contained in a part is maintained only once and that readers can recognize the duplication specification and not identified by the automated detector.

7 Threats to Validity: External ValidityQuestions and Discussion o The practice of requirements engineering differs stronglo How could we detect document fragments that convey between different domains, companies, and even similar information but are different on the word level? projects defaulting the generalization of the results. o What qualitatively effects could be used during the content analysis of the source code? o Can be the results generalized to any domain, any company and any kind of software system? Thanks!

Management. Software Quality. Dr. Stefan Wagner Technische Universität München. Garching 28 May 2010

Management. Software Quality. Dr. Stefan Wagner Technische Universität München. Garching 28 May 2010 Technische Universität München Software Quality Management Dr. Stefan Wagner Technische Universität München Garching 28 May 2010 Some of these slides were adapted from the tutorial "Clone Detection in

More information

Software product quality control Dr. Stefan Wagner Dr. Florian Deißenböck Technische Universität München

Software product quality control Dr. Stefan Wagner Dr. Florian Deißenböck Technische Universität München Tool-supported Software product quality control Dr. Stefan Wagner Dr. Florian Deißenböck Technische Universität München Google Developer Day Munich November 9, 2010 Continuous Quality Control Quality Model

More information

Flexible Architecture Conformance Assessment with ConQAT

Flexible Architecture Conformance Assessment with ConQAT Flexible Architecture Conformance Assessment with ConQAT Florian Deissenboeck, Lars Heinemann, Benjamin Hummel, Elmar Juergens Technische Universität München ICSE 2010 Cape Town Software Architecture Software

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Information Extraction Techniques in Terrorism Surveillance

Information Extraction Techniques in Terrorism Surveillance Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

Chapter 4. Processing Text

Chapter 4. Processing Text Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are

More information

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization Jakub Kanis, Luděk Müller University of West Bohemia, Department of Cybernetics, Univerzitní 8, 306 14 Plzeň, Czech Republic {jkanis,muller}@kky.zcu.cz

More information

IJRIM Volume 2, Issue 2 (February 2012) (ISSN )

IJRIM Volume 2, Issue 2 (February 2012) (ISSN ) AN ENHANCED APPROACH TO OPTIMIZE WEB SEARCH BASED ON PROVENANCE USING FUZZY EQUIVALENCE RELATION BY LEMMATIZATION Divya* Tanvi Gupta* ABSTRACT In this paper, the focus is on one of the pre-processing technique

More information

LAB 3: Text processing + Apache OpenNLP

LAB 3: Text processing + Apache OpenNLP LAB 3: Text processing + Apache OpenNLP 1. Motivation: The text that was derived (e.g., crawling + using Apache Tika) must be processed before being used in an information retrieval system. Text processing

More information

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique 1 Syed MohdFazalulHaque, 2 Dr. V Srikanth, 3 Dr. E. Sreenivasa Reddy 1 Maulana Azad National Urdu University, 2 Professor,

More information

Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries

Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries Reza Taghizadeh Hemayati 1, Weiyi Meng 1, Clement Yu 2 1 Department of Computer Science, Binghamton university,

More information

Evaluation of similarity metrics for programming code plagiarism detection method

Evaluation of similarity metrics for programming code plagiarism detection method Evaluation of similarity metrics for programming code plagiarism detection method Vedran Juričić Department of Information Sciences Faculty of humanities and social sciences University of Zagreb I. Lučića

More information

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price. Code Duplication New Proposal Dolores Zage, Wayne Zage Ball State University June 1, 2017 July 31, 2018 Long Term Goals The goal of this project is to enhance the identification of code duplication which

More information

Artop (AUTOSAR Tool Platform) Whitepaper

Artop (AUTOSAR Tool Platform) Whitepaper Artop (AUTOSAR Tool Platform) Whitepaper Updated version: March 2009 Michael Rudorfer 1, Stefan Voget 2, Stephan Eberle 3 1 BMW Car IT GmbH, Petuelring 116, 80809 Munich, Germany 2 Continental, Siemensstraße

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS Introducing XAIRA An XML aware tool for corpus indexing and searching Lou Burnard Tony Dodd Research Technology Services, OUCS What is XAIRA? XML Aware Indexing and Retrieval Architecture Developed from

More information

Watermark-Based Authentication and Key Exchange in Teleconferencing Systems

Watermark-Based Authentication and Key Exchange in Teleconferencing Systems Watermark-Based Authentication and Key Exchange in Teleconferencing Systems Ulrich Rührmair a, Stefan Katzenbeisser b, Martin Steinebach c, and Sascha Zmudzinski c a Technische Universität München, Department

More information

Design Patterns. An introduction

Design Patterns. An introduction Design Patterns An introduction Introduction Designing object-oriented software is hard, and designing reusable object-oriented software is even harder. Your design should be specific to the problem at

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

2 The IBM Data Governance Unified Process

2 The IBM Data Governance Unified Process 2 The IBM Data Governance Unified Process The benefits of a commitment to a comprehensive enterprise Data Governance initiative are many and varied, and so are the challenges to achieving strong Data Governance.

More information

Detection and Handling of Model Smells for MATLAB/Simulink Models

Detection and Handling of Model Smells for MATLAB/Simulink Models Detection and Handling of Model Smells for MATLAB/Simulink Models Thomas Gerlitz 1, Quang Minh Tran 2, and Christian Dziobek 3 1 Informatik 11 - Embedded Software, RWTH Aachen, Germany gerlitz@embedded.rwth-aachen.de

More information

International Journal for Management Science And Technology (IJMST)

International Journal for Management Science And Technology (IJMST) Volume 4; Issue 03 Manuscript- 1 ISSN: 2320-8848 (Online) ISSN: 2321-0362 (Print) International Journal for Management Science And Technology (IJMST) GENERATION OF SOURCE CODE SUMMARY BY AUTOMATIC IDENTIFICATION

More information

The KNIME Text Processing Plugin

The KNIME Text Processing Plugin The KNIME Text Processing Plugin Kilian Thiel Nycomed Chair for Bioinformatics and Information Mining, University of Konstanz, 78457 Konstanz, Deutschland, Kilian.Thiel@uni-konstanz.de Abstract. This document

More information

Data Models: The Center of the Business Information Systems Universe

Data Models: The Center of the Business Information Systems Universe Data s: The Center of the Business Information Systems Universe Whitemarsh Information Systems Corporation 2008 Althea Lane Bowie, Maryland 20716 Tele: 301-249-1142 Email: Whitemarsh@wiscorp.com Web: www.wiscorp.com

More information

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens Syntactic Analysis CS45H: Programming Languages Lecture : Lexical Analysis Thomas Dillig Main Question: How to give structure to strings Analogy: Understanding an English sentence First, we separate a

More information

Analysing Text in Software Projects

Analysing Text in Software Projects Analysing Text in Software Projects Stefan Wagner a,, Daniel Méndez Fernández b a Software Engineering Group, Institute of Software Technology, University of Stuttgart, Universitätsstr. 38, 70569 Stuttgart,

More information

Precise Medication Extraction using Agile Text Mining

Precise Medication Extraction using Agile Text Mining Precise Medication Extraction using Agile Text Mining Chaitanya Shivade *, James Cormack, David Milward * The Ohio State University, Columbus, Ohio, USA Linguamatics Ltd, Cambridge, UK shivade@cse.ohio-state.edu,

More information

Model Clone Detection in Practice

Model Clone Detection in Practice Model Clone Detection in Practice Florian Deissenboeck, Benjamin Hummel Elmar Juergens, Michael Pfaehler Technische Universität München Garching b. München, Germany Bernhard Schaetz fortiss ggmbh München,

More information

David Hellenbrand and Udo Lindemann Technische Universität München, Institute of Product Development, Germany

David Hellenbrand and Udo Lindemann Technische Universität München, Institute of Product Development, Germany 10 TH INTERNATIONAL DESIGN STRUCTURE MATRIX CONFERENCE, DSM 08 11 12 NOVEMBER 2008, STOCKHOLM, SWEDEN USING THE DSM TO SUPPORT THE SELECTION OF PRODUCT CONCEPTS David Hellenbrand and Udo Lindemann Technische

More information

TIC: A Topic-based Intelligent Crawler

TIC: A Topic-based Intelligent Crawler 2011 International Conference on Information and Intelligent Computing IPCSIT vol.18 (2011) (2011) IACSIT Press, Singapore TIC: A Topic-based Intelligent Crawler Hossein Shahsavand Baghdadi and Bali Ranaivo-Malançon

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

An Approach to Detect Clones in Class Diagram Based on Suffix Array

An Approach to Detect Clones in Class Diagram Based on Suffix Array An Approach to Detect Clones in Class Diagram Based on Suffix Array Amandeep Kaur, Computer Science and Engg. Department, BBSBEC Fatehgarh Sahib, Punjab, India. Manpreet Kaur, Computer Science and Engg.

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

DLFinder: Characterizing and Detecting Duplicate Logging Code Smells

DLFinder: Characterizing and Detecting Duplicate Logging Code Smells DLFinder: Characterizing and Detecting Duplicate Logging Code Smells Zhenhao Li, Tse-Hsun (Peter) Chen, Jinqiu Yang and Weiyi Shang Software PErformance, Analysis, and Reliability (SPEAR) Lab Department

More information

Design First ITS Instructor Tool

Design First ITS Instructor Tool Design First ITS Instructor Tool The Instructor Tool allows instructors to enter problems into Design First ITS through a process that creates a solution for a textual problem description and allows for

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Assessing the Quality of Natural Language Text

Assessing the Quality of Natural Language Text Assessing the Quality of Natural Language Text DC Research Ulm (RIC/AM) daniel.sonntag@dfki.de GI 2004 Agenda Introduction and Background to Text Quality Text Quality Dimensions Intrinsic Text Quality,

More information

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY Anna Corazza, Sergio Di Martino, Valerio Maggio Alessandro Moschitti, Andrea Passerini, Giuseppe Scanniello, Fabrizio Silverstri JIMSE 2012 August 28, 2012

More information

13 AutoFocus 3 - A Scientific Tool Prototype for Model-Based Development of Component-Based, Reactive, Distributed Systems

13 AutoFocus 3 - A Scientific Tool Prototype for Model-Based Development of Component-Based, Reactive, Distributed Systems 13 AutoFocus 3 - A Scientific Tool Prototype for Model-Based Development of Component-Based, Reactive, Distributed Systems Florian Hölzl and Martin Feilkas Institut für Informatik Technische Universität

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

CMSC 447: Software Engineering I

CMSC 447: Software Engineering I CMSC 447: Software Engineering I General Instructions System Requirements Specification Template (Adapted from Susan Mitchell and Michael Grasso) 1. Provide a cover page that includes the document name,

More information

Enabling of AUTOSAR system design using Eclipse-based tooling

Enabling of AUTOSAR system design using Eclipse-based tooling Enabling of AUTOSAR system design using Eclipse-based tooling H. Heinecke, M. Rudorfer, P. Hoser, C. Ainhauser, O. Scheickl BMW Car IT GmbH, Petuelring 116, 80809 Munich, Germany Abstract: AUTOSAR is a

More information

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies Diego Cavalcanti 1, Dalton Guerrero 1, Jorge Figueiredo 1 1 Software Practices Laboratory (SPLab) Federal University of Campina

More information

How to import text transcription

How to import text transcription How to import text transcription This document explains how to import transcriptions of spoken language created with a text editor or a word processor into the Partitur-Editor using the Simple EXMARaLDA

More information

Chapter 2 Overview of the Design Methodology

Chapter 2 Overview of the Design Methodology Chapter 2 Overview of the Design Methodology This chapter presents an overview of the design methodology which is developed in this thesis, by identifying global abstraction levels at which a distributed

More information

Natural Language Requirements

Natural Language Requirements Natural Language Requirements Software Verification and Validation Laboratory Requirement Elaboration Heuristic Domain Model» Requirement Relationship Natural Language is elaborated via Requirement application

More information

Taxonomies and controlled vocabularies best practices for metadata

Taxonomies and controlled vocabularies best practices for metadata Original Article Taxonomies and controlled vocabularies best practices for metadata Heather Hedden is the taxonomy manager at First Wind Energy LLC. Previously, she was a taxonomy consultant with Earley

More information

Quality Indicators for Automotive Test Case Specifications

Quality Indicators for Automotive Test Case Specifications Quality Indicators for Automotive Test Case Specifications Katharina Juhnke Daimler AG Group Research & MBC Development Email: katharina.juhnke@daimler.com Matthias Tichy Ulm University Institute of Software

More information

An Adaptive Framework for Named Entity Combination

An Adaptive Framework for Named Entity Combination An Adaptive Framework for Named Entity Combination Bogdan Sacaleanu 1, Günter Neumann 2 1 IMC AG, 2 DFKI GmbH 1 New Business Department, 2 Language Technology Department Saarbrücken, Germany E-mail: Bogdan.Sacaleanu@im-c.de,

More information

License.

License. License This document is licensed under the terms of the Creative Commons Attribution-Noncommercial 3.0 Germany license. You are allowed to to Share to copy, distribute, and transmit the document to Remix

More information

Sentiment Classification of Food Reviews

Sentiment Classification of Food Reviews Sentiment Classification of Food Reviews Hua Feng Department of Electrical Engineering Stanford University Stanford, CA 94305 fengh15@stanford.edu Ruixi Lin Department of Electrical Engineering Stanford

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Programming Exercise 6: Support Vector Machines

Programming Exercise 6: Support Vector Machines Programming Exercise 6: Support Vector Machines Machine Learning May 13, 2012 Introduction In this exercise, you will be using support vector machines (SVMs) to build a spam classifier. Before starting

More information

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Donald C. Comeau *, Haibin Liu, Rezarta Islamaj Doğan and W. John Wilbur National Center

More information

Java Archives Search Engine Using Byte Code as Information Source

Java Archives Search Engine Using Byte Code as Information Source Java Archives Search Engine Using Byte Code as Information Source Oscar Karnalim School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia 23512012@std.stei.itb.ac.id

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Proofwriting Checklist

Proofwriting Checklist CS103 Winter 2019 Proofwriting Checklist Cynthia Lee Keith Schwarz Over the years, we ve found many common proofwriting errors that can easily be spotted once you know how to look for them. In this handout,

More information

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India Volume 3, Issue 11, November 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Study of Different

More information

Form Identifying. Figure 1 A typical HTML form

Form Identifying. Figure 1 A typical HTML form Table of Contents Form Identifying... 2 1. Introduction... 2 2. Related work... 2 3. Basic elements in an HTML from... 3 4. Logic structure of an HTML form... 4 5. Implementation of Form Identifying...

More information

Using Code Coverage to Improve the Reliability of Embedded Software. Whitepaper V

Using Code Coverage to Improve the Reliability of Embedded Software. Whitepaper V Using Code Coverage to Improve the Reliability of Embedded Software Whitepaper V2.0 2017-12 Table of Contents 1 Introduction... 3 2 Levels of Code Coverage... 3 2.1 Statement Coverage... 3 2.2 Statement

More information

time now it has also been used productively in a multi-oem, requires precise knowledge of the protocol, the layout, the

time now it has also been used productively in a multi-oem, requires precise knowledge of the protocol, the layout, the ODX in Practice Experiences, challenges and potential The diagnostic exchange format ODX has been implemented successfully in a number of pilot projects. For the first time now it has also been used productively

More information

Modeling Crisis Management System With the Restricted Use Case Modeling Approach

Modeling Crisis Management System With the Restricted Use Case Modeling Approach Modeling Crisis Management System With the Restricted Use Case Modeling Approach Gong Zhang 1, Tao Yue 2, and Shaukat Ali 3 1 School of Computer Science and Engineering, Beihang University, Beijing, China

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits Carl Sabottke Octavian Suciu Tudor Dumitraș University of Maryland 2 Problem Increasing number

More information

arxiv: v1 [cs.se] 22 Nov 2016

arxiv: v1 [cs.se] 22 Nov 2016 Will My Tests Tell Me If I Break This Code? Rainer Niedermayr, Elmar Juergens CQSE GmbH Garching b. München, Germany {niedermayr, juergens@cqse.eu Stefan Wagner University of Stuttgart Stuttgart, Germany

More information

Chapter 4. Abstract Syntax

Chapter 4. Abstract Syntax Chapter 4 Abstract Syntax Outline compiler must do more than recognize whether a sentence belongs to the language of a grammar it must do something useful with that sentence. The semantic actions of a

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume

More information

Center for Reflected Text Analytics. Lecture 2 Annotation tools & Segmentation

Center for Reflected Text Analytics. Lecture 2 Annotation tools & Segmentation Center for Reflected Text Analytics Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory Guidelines Inter-Annotator agreement Inter-subjective annotations Annotation exercise Discuss

More information

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations

More information

CS2112 Fall Assignment 4 Parsing and Fault Injection. Due: March 18, 2014 Overview draft due: March 14, 2014

CS2112 Fall Assignment 4 Parsing and Fault Injection. Due: March 18, 2014 Overview draft due: March 14, 2014 CS2112 Fall 2014 Assignment 4 Parsing and Fault Injection Due: March 18, 2014 Overview draft due: March 14, 2014 Compilers and bug-finding systems operate on source code to produce compiled code and lists

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

iserver Free Archimate ArchiMate 1.0 Template Stencil: Getting from Started Orbus Guide Software Thanks for Downloading the Free ArchiMate Template! Orbus Software have created a set of Visio ArchiMate

More information

Decision Management in the Insurance Industry: Standards and Tools

Decision Management in the Insurance Industry: Standards and Tools Decision Management in the Insurance Industry: Standards and Tools Kimon Batoulis 1, Alexey Nesterenko 2, Günther Repitsch 2, and Mathias Weske 1 1 Hasso Plattner Institute, University of Potsdam, Potsdam,

More information

Requirements. Chapter Learning objectives of this chapter. 2.2 Definition and syntax

Requirements. Chapter Learning objectives of this chapter. 2.2 Definition and syntax Chapter 2 Requirements A requirement is a textual description of system behaviour. A requirement describes in plain text, usually English, what a system is expected to do. This is a basic technique much

More information

III Data Structures. Dynamic sets

III Data Structures. Dynamic sets III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations

More information

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),

More information

efmea RAISING EFFICIENCY OF FMEA BY MATRIX-BASED FUNCTION AND FAILURE NETWORKS

efmea RAISING EFFICIENCY OF FMEA BY MATRIX-BASED FUNCTION AND FAILURE NETWORKS efmea RAISING EFFICIENCY OF FMEA BY MATRIX-BASED FUNCTION AND FAILURE NETWORKS Maik Maurer Technische Universität München, Product Development, Boltzmannstr. 15, 85748 Garching, Germany. Email: maik.maurer@pe.mw.tum.de

More information

Pattern Mining in Frequent Dynamic Subgraphs

Pattern Mining in Frequent Dynamic Subgraphs Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analysis. Lecture 3. January 10, 2018 Lexical Analysis Lecture 3 January 10, 2018 Announcements PA1c due tonight at 11:50pm! Don t forget about PA1, the Cool implementation! Use Monday s lecture, the video guides and Cool examples if you re

More information

Customisable Curation Workflows in Argo

Customisable Curation Workflows in Argo Customisable Curation Workflows in Argo Rafal Rak*, Riza Batista-Navarro, Andrew Rowley, Jacob Carter and Sophia Ananiadou National Centre for Text Mining, University of Manchester, UK *Corresponding author:

More information

Application documentation Documentation

Application documentation Documentation Application documentation Documentation Release 0.1 Daniele Procida June 14, 2016 Contents 1 Tutorial 3 1.1 Setting up................................................. 3 1.2 Configuring the documentation.....................................

More information

Towards Systematic Usability Verification

Towards Systematic Usability Verification Towards Systematic Usability Verification Max Möllers RWTH Aachen University 52056 Aachen, Germany max@cs.rwth-aachen.de Jonathan Diehl RWTH Aachen University 52056 Aachen, Germany diehl@cs.rwth-aachen.de

More information

Knowledge Extraction from German Automotive Software Requirements using NLP-Techniques and a Grammar-based Pattern Detection

Knowledge Extraction from German Automotive Software Requirements using NLP-Techniques and a Grammar-based Pattern Detection Knowledge Extraction from German Automotive Software s using NLP-Techniques and a Grammar-based Pattern Detection Mathias Schraps Software Development Audi Electronics Venture GmbH 85080 Gaimersheim, Germany

More information

Name: Lirong TAN 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G.

Name: Lirong TAN 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G. 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G. A shortest s-t path is a path from vertex to vertex, whose sum of edge weights is minimized. (b) Give the pseudocode

More information

CPSC 695. Data Quality Issues M. L. Gavrilova

CPSC 695. Data Quality Issues M. L. Gavrilova CPSC 695 Data Quality Issues M. L. Gavrilova 1 Decisions Decisions 2 Topics Data quality issues Factors affecting data quality Types of GIS errors Methods to deal with errors Estimating degree of errors

More information

Turn Indicator Model Overview

Turn Indicator Model Overview Turn Indicator Model Overview Jan Peleska 1, Florian Lapschies 1, Helge Löding 2, Peer Smuda 3, Hermann Schmid 3, Elena Vorobev 1, and Cornelia Zahlten 2 1 Department of Mathematics and Computer Science

More information

On the automatic classification of app reviews

On the automatic classification of app reviews Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please

More information

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Exploiting Internal and External Semantics for the Using World Knowledge, 1,2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 1 School of Computing National University of Singapore 2 School of Computer Science

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Natural Language to Database Interface

Natural Language to Database Interface Natural Language to Database Interface Aarti Sawant 1, Pooja Lambate 2, A. S. Zore 1 Information Technology, University of Pune, Marathwada Mitra Mandal Institute Of Technology. Pune, Maharashtra, India

More information

Introduction to Lexical Analysis

Introduction to Lexical Analysis Introduction to Lexical Analysis Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions Examples

More information

Apache UIMA ConceptMapper Annotator Documentation

Apache UIMA ConceptMapper Annotator Documentation Apache UIMA ConceptMapper Annotator Documentation Written and maintained by the Apache UIMA Development Community Version 2.3.1 Copyright 2006, 2011 The Apache Software Foundation License and Disclaimer.

More information

MetaNews: An Information Agent for Gathering News Articles On the Web

MetaNews: An Information Agent for Gathering News Articles On the Web MetaNews: An Information Agent for Gathering News Articles On the Web Dae-Ki Kang 1 and Joongmin Choi 2 1 Department of Computer Science Iowa State University Ames, IA 50011, USA dkkang@cs.iastate.edu

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information