Natural Language Processing Is No Free Lunch
|
|
- Aubrey Lang
- 6 years ago
- Views:
Transcription
1 Natural Language Processing Is No Free Lunch STEFAN WAGNER UNIVERSITY OF STUTTGART, STUTTGART, GERMANY ntroduction o Impressive progress in NLP: OS with personal assistants like Siri or Cortan o Brief check on how and how not to apply NLP in software analytics. o Study case: NLP applied over the documentation of software systems. o Most of the documentation, although structured and versatile (JavaDoc and Doxygen), focuses on the level of functions/methods and classes, while the component level is often missing. Why don t use the former to generate the latter? NL Data in Software Projects How to apply NLP: 2 techniques o There a lot of NL in a software project: o Textual documentation for the user or the software architecture. o Commit messages and issue descriptions. o Comments in the source code. o Nowadays there s a wide range of algorithms that allows processing large NL datasets. o Part-of-speech tagging: Returns the grammatical use of each word (verb, noun, or determiner) o Topic modeling: Extracts the most probable topics in a NL dataset. o Stemming Process of removing morphological nd inflexional endings from words. Read and reading à they both orrespond to the same word. o Lemmatization It employs dictionary and morphological analysis to return the base form of a word (called lemma Better has lemma good.(stemmin would miss it).
2 NLP is no magic o You have to ty alternative algorithms and tune them during the analysis. o The results will depend strongly on the quality of the analyzed NL data: o The analysis of an official Java library, which is well documented, will work well, while o The analysis of open source code (which has fewer comments) will provide less useful results. o Don t discard manual analysis o Humans can make connections bas on their own knowledge and experience, and they can formulate results easily accessible to other humans. o You can use manual feedback combined with NLP techniques. o Systematic manual analysis: It requires a simple coding of the textu data, where a tag or code (a word, a sentence, or a whole paragraph) is attached to the piece of NL analyzed Can Clone Detection Support Quality Assessments of Requirements Specifications? Elmar Juergens, Florian Deissenboeck, rtin Feilkas, Benjamin Hummel, rnhard Schaetz, Stefan Wagner Technische Universität München Garching b. München, Germany Christoph Domann, Jonathan Streit itestra GmbH Garching b. München, Germany ntroduction o Software requirements specifications (SRS) are the keystone of most software projects. o They influence the product s quality and the effort spent on development. o They are the key (and often only) communication artifact between customer and contractor. o SRS are mostly written in NL à Few techniques for automated quality assessment. o It s possible to use clone detection to tackle redundancy. o Clone detection is commonly applied to find duplications in source code (cloning), which can. o Increase a project's size and the effort required for size-related activities. o Lead to errors, caused by inconsistent changes. Research Problem o 4 objectives: o Do real-world requirements specifications contain duplicated informatio o What kind of information is duplicated? o Which consequences the duplication of information has on the different software development activities. o Can existing clone detection approaches be applied in practice to identi duplication in SRS automatically? o 28 SRS analyzed à a total of 8,667 pages.
3 Terminology o Requirements specification (RS): a specification for a particular software product, program, or set of programs that performs certain functions in a specific environment. o A RS can be interpreted as a single sequence of words. o A normalized RS is obtained when its set of words is transformed by grouping sets of similar words. o A specification clone is a substring of the normalized specification with a certain minimal length, appearing at least twice. o A clone group contains all clones of a specification that have the same content. o Clones of a relevant clone group must convey similar information and this information must refer to the system described à system interaction steps. o Clone coverage denotes the part of a specification that is covered by cloning. represents the probability that an arbitra chosen specification sentence is cloned least once. o Number of clone groups and clones denotes how many different logical specification fragments have been copie and how often they occur. o Blow-up describes how much larger the specification is compared to a hypotheti specification version that contains no clo Methodology: Study Definition o 4 research questions : o RQ1. How much cloning do real-world requirements specifications contain? o RQ2. What kind of information is cloned in requirements specifications? o RQ3. What consequences does cloning in requirements specifications have? o RQ4. Can cloning in requirements specifications be detected accurately using existing clone detectors? o Content analysis of study objects specifications from industrial projects performed with clone detection and manually. Methodology: Study Design. First, RS are assigned randomly to pairs of researchers for further analysis.. Clone detection is performed on all documents of a specification.. The researcher pairs perform clone detection tailoring for each specification.. They manually inspect detected clones for false positives, adding filters to remove the appearance of these false positives. o The sequence 2, 3, 4 is repeated until no false positives are found in a random sample of the detected clone groups. Methodology: Study Design o For each specification, a random sample of clone groups is analyzed, base on the kind of information they contain. The clone is assigned to all suitable categories. o On selected specifications, content analysis of the source code is performe The code corresponding to specification clones is studied in order to classi whether the specification cloning resulted in code cloning, duplicated functionality without cloning, or was resolved through the creation of a shared abstraction.
4 Methodology: Study Objects o The 28 RS are from various domains: administration, automotive, convenience, finance, telecommunication, and transportation. o The RS were obtained from different organizations, including: o Munich Re Group: one of the largest reinsurance companies in the world and employs more than 47,000 people in over 50 locations. o Siemens AG: the largest engineering company in Europe. o MOST Cooperation: a partnership of car manufacturers (including Audi, BMW and Daimler) and component suppliers. Methodology: Study Implementation and Execution o RQ1: The tool ConQAT is used to perform clone detection and to compute the clone measures. Detection is performed with a minimal clone length of 20 words. o RQ2: If more than 20 clone groups are found for a specification, the manua classification is performed on a random sample of 20 clone groups. Else, a clone groups for a specification are inspected. During inspection, 8 categories were added and 1 was changed. o RQ3: Relative blow-up is computed as the ratio of the total number of word to the number of redundancy-free words. Absolute blow-up is computed as the difference of total and redundancy free number of words. Methodology: Study Implementation and Execution o An average reading speed of 220 words per minute was used to calculate the additional effort for reading, while for the inspection task the metric corresponded to 600 words per hour. o RQ4: Precision is determined by measuring the percentage of the relevant clones in the inspected sample. Clone detection tailoring is performed by creating regular expressions that match the false positives. A maximum number of 20 randomly chosen clone groups is inspected in each tailoring step. The Clone Detection Tool ut and Pre-Processing The input phase reads the documents and produce a normalized word stream (using the Porter stemmer algorithm). It requires all the input data to be plain text. After reading the text contents of a specification, certain sections of the documents are excluded. The resulting text is split into single words; whitespace and punctuation is discarded. ection This phase extracts all substrings in the word stream that are sufficiently long and occur at least twice. The algorithm works by constructing a suffix tree from the token (word) stream. Each branch of the tree which reaches at least two leaves corresponds to a clone. st-processing and Output During post-processing, all clone groups which contain overlapping clones are removed. The output phase calculates several metrics on the clones.
5 Results: RQ1 Amount of Cloning Results: RQ2 Cloned Information o Clone group cardi the number of tim specification fragme been cloned. 1. Detailed Use Case Steps 2. Reference 3. UI 4. Domain Knowledge 5. Interface Description 6. Pre-Condition 7. Side-Condition 8. Configuration 9. Feature 10. Technical Domain Knowledge 11. Post-Condition 12. Rationale Results: RQ3 Consequences of CloningResults: RQ3 Consequences of Cloning ecification Reading o The average blow-up of the analyzed SRS is 3,578 words which, at typical reading speed of 220 words per minute translates to additional 16 minutes. o This amount increases to 6 hours for the inspection task. ecification Modification o The comments documented during the inspection of the sampled clones were analyzed (for each specification set). They refer to duplicated specification fragments that are essentially longer than the clones detected by the tool. Specification Implementation For the inspected 20 specification clone groups and their source codes 3 different effects were found: 1. The redundancy in the requirements is not reflected in the code. It contains shared abstractions that avoid duplication. 2. The code that implements a cloned piece of an SRS is also cloned. In this case, future changes to the cloned code cause additional efforts as modification must be reflected in all clones. Furthermore, changes to cloned code are errorprone as inconsistencies may be introduced accidentally. 3. Code of the same functionality has been implemented multiple times. This kind of redundancy is harder to detect as existing clone detection approaches cannot find code that is functionally similar but not the result of copy&paste
6 Results: RQ4 Detection Tailoring and Accuracy e false positives contain information from e following categories Document meta data comprises information about he creation process of the document. ndexes do not add new information and are ypically generated automatically by text processors. Page decorations are typically automatically nserted by text processors. Open issues document gaps in the specification..e. TODO statements. Specification template information contains section names common to all individual documents hat are part of a specification. o By using clone detection tailoring precisio were above 85%, with an average of 99% o The time required for tailoring varies betw and 33 minutes across specifications, wit average value of 10 minutes. Conclusions and Future Work o The amount of cloning encountered is significant. However, as shown with the broad spectrum of findings, cloning in SRS can be successfully avoided o Cloning is not confined to a specific kind of information. o The most obvious effect of duplication is the increased size, which could be avoided by cross references or different organization of the specifications. Another consequence is the increase on the time spent reading the RS. o Redundancy may lead to inconsistent changes of the clones, which my induce errors in the RS and thus in the final system. o Specification cloning can lead to cloned or re-implemented parts of code. Conclusions and Future Work o Excising clone detection approaches can be applied to identify cloned information in SRS. However, a certain amount of analysis tailoring is required to increase detection precision. o Without any pervious knowledge, one must assume that the probability thao Subjectivity during the categorization of the cloned information à an arbitrary sentence in the specification is duplicated is greater than 10%. Researches in pairs and inter-rater agreement. o One should make SRS authors and reviewers aware of the problems that SRS cloning has and avoid redundancy from the beginning on. Threats to Validity: Internal Validity o Results influenced due to mistakes or individual preferences of researches during the tailoring phase à Clone tailoring in pairs. o Precision was determined on random samples. o Inaccurate calculation of additional effort due to blow-up. o Cloned and non-cloned text treated uniformly with respect to reading efforts o Few interest in detection recall. o Sometimes duplication is employed intentionally in order to make a part of SRS self-contained. In this case, you have to make sure that the duplicatedo No research of false negatives, the amount of duplication contained in a part is maintained only once and that readers can recognize the duplication specification and not identified by the automated detector.
7 Threats to Validity: External ValidityQuestions and Discussion o The practice of requirements engineering differs stronglo How could we detect document fragments that convey between different domains, companies, and even similar information but are different on the word level? projects defaulting the generalization of the results. o What qualitatively effects could be used during the content analysis of the source code? o Can be the results generalized to any domain, any company and any kind of software system? Thanks!
Management. Software Quality. Dr. Stefan Wagner Technische Universität München. Garching 28 May 2010
Technische Universität München Software Quality Management Dr. Stefan Wagner Technische Universität München Garching 28 May 2010 Some of these slides were adapted from the tutorial "Clone Detection in
More informationSoftware product quality control Dr. Stefan Wagner Dr. Florian Deißenböck Technische Universität München
Tool-supported Software product quality control Dr. Stefan Wagner Dr. Florian Deißenböck Technische Universität München Google Developer Day Munich November 9, 2010 Continuous Quality Control Quality Model
More informationFlexible Architecture Conformance Assessment with ConQAT
Flexible Architecture Conformance Assessment with ConQAT Florian Deissenboeck, Lars Heinemann, Benjamin Hummel, Elmar Juergens Technische Universität München ICSE 2010 Cape Town Software Architecture Software
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationInformation Extraction Techniques in Terrorism Surveillance
Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing
More informationInfluence of Word Normalization on Text Classification
Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we
More informationChapter 4. Processing Text
Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are
More informationAutomatic Lemmatizer Construction with Focus on OOV Words Lemmatization
Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization Jakub Kanis, Luděk Müller University of West Bohemia, Department of Cybernetics, Univerzitní 8, 306 14 Plzeň, Czech Republic {jkanis,muller}@kky.zcu.cz
More informationIJRIM Volume 2, Issue 2 (February 2012) (ISSN )
AN ENHANCED APPROACH TO OPTIMIZE WEB SEARCH BASED ON PROVENANCE USING FUZZY EQUIVALENCE RELATION BY LEMMATIZATION Divya* Tanvi Gupta* ABSTRACT In this paper, the focus is on one of the pre-processing technique
More informationLAB 3: Text processing + Apache OpenNLP
LAB 3: Text processing + Apache OpenNLP 1. Motivation: The text that was derived (e.g., crawling + using Apache Tika) must be processed before being used in an information retrieval system. Text processing
More informationA Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique
A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique 1 Syed MohdFazalulHaque, 2 Dr. V Srikanth, 3 Dr. E. Sreenivasa Reddy 1 Maulana Azad National Urdu University, 2 Professor,
More informationIdentifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries
Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries Reza Taghizadeh Hemayati 1, Weiyi Meng 1, Clement Yu 2 1 Department of Computer Science, Binghamton university,
More informationEvaluation of similarity metrics for programming code plagiarism detection method
Evaluation of similarity metrics for programming code plagiarism detection method Vedran Juričić Department of Information Sciences Faculty of humanities and social sciences University of Zagreb I. Lučića
More informationThe goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.
Code Duplication New Proposal Dolores Zage, Wayne Zage Ball State University June 1, 2017 July 31, 2018 Long Term Goals The goal of this project is to enhance the identification of code duplication which
More informationArtop (AUTOSAR Tool Platform) Whitepaper
Artop (AUTOSAR Tool Platform) Whitepaper Updated version: March 2009 Michael Rudorfer 1, Stefan Voget 2, Stephan Eberle 3 1 BMW Car IT GmbH, Petuelring 116, 80809 Munich, Germany 2 Continental, Siemensstraße
More informationThe Goal of this Document. Where to Start?
A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce
More informationComment Extraction from Blog Posts and Its Applications to Opinion Mining
Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
More informationEmpirical Analysis of Single and Multi Document Summarization using Clustering Algorithms
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department
More informationIntroducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS
Introducing XAIRA An XML aware tool for corpus indexing and searching Lou Burnard Tony Dodd Research Technology Services, OUCS What is XAIRA? XML Aware Indexing and Retrieval Architecture Developed from
More informationWatermark-Based Authentication and Key Exchange in Teleconferencing Systems
Watermark-Based Authentication and Key Exchange in Teleconferencing Systems Ulrich Rührmair a, Stefan Katzenbeisser b, Martin Steinebach c, and Sascha Zmudzinski c a Technische Universität München, Department
More informationDesign Patterns. An introduction
Design Patterns An introduction Introduction Designing object-oriented software is hard, and designing reusable object-oriented software is even harder. Your design should be specific to the problem at
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More information2 The IBM Data Governance Unified Process
2 The IBM Data Governance Unified Process The benefits of a commitment to a comprehensive enterprise Data Governance initiative are many and varied, and so are the challenges to achieving strong Data Governance.
More informationDetection and Handling of Model Smells for MATLAB/Simulink Models
Detection and Handling of Model Smells for MATLAB/Simulink Models Thomas Gerlitz 1, Quang Minh Tran 2, and Christian Dziobek 3 1 Informatik 11 - Embedded Software, RWTH Aachen, Germany gerlitz@embedded.rwth-aachen.de
More informationInternational Journal for Management Science And Technology (IJMST)
Volume 4; Issue 03 Manuscript- 1 ISSN: 2320-8848 (Online) ISSN: 2321-0362 (Print) International Journal for Management Science And Technology (IJMST) GENERATION OF SOURCE CODE SUMMARY BY AUTOMATIC IDENTIFICATION
More informationThe KNIME Text Processing Plugin
The KNIME Text Processing Plugin Kilian Thiel Nycomed Chair for Bioinformatics and Information Mining, University of Konstanz, 78457 Konstanz, Deutschland, Kilian.Thiel@uni-konstanz.de Abstract. This document
More informationData Models: The Center of the Business Information Systems Universe
Data s: The Center of the Business Information Systems Universe Whitemarsh Information Systems Corporation 2008 Althea Lane Bowie, Maryland 20716 Tele: 301-249-1142 Email: Whitemarsh@wiscorp.com Web: www.wiscorp.com
More informationSyntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens
Syntactic Analysis CS45H: Programming Languages Lecture : Lexical Analysis Thomas Dillig Main Question: How to give structure to strings Analogy: Understanding an English sentence First, we separate a
More informationAnalysing Text in Software Projects
Analysing Text in Software Projects Stefan Wagner a,, Daniel Méndez Fernández b a Software Engineering Group, Institute of Software Technology, University of Stuttgart, Universitätsstr. 38, 70569 Stuttgart,
More informationPrecise Medication Extraction using Agile Text Mining
Precise Medication Extraction using Agile Text Mining Chaitanya Shivade *, James Cormack, David Milward * The Ohio State University, Columbus, Ohio, USA Linguamatics Ltd, Cambridge, UK shivade@cse.ohio-state.edu,
More informationModel Clone Detection in Practice
Model Clone Detection in Practice Florian Deissenboeck, Benjamin Hummel Elmar Juergens, Michael Pfaehler Technische Universität München Garching b. München, Germany Bernhard Schaetz fortiss ggmbh München,
More informationDavid Hellenbrand and Udo Lindemann Technische Universität München, Institute of Product Development, Germany
10 TH INTERNATIONAL DESIGN STRUCTURE MATRIX CONFERENCE, DSM 08 11 12 NOVEMBER 2008, STOCKHOLM, SWEDEN USING THE DSM TO SUPPORT THE SELECTION OF PRODUCT CONCEPTS David Hellenbrand and Udo Lindemann Technische
More informationTIC: A Topic-based Intelligent Crawler
2011 International Conference on Information and Intelligent Computing IPCSIT vol.18 (2011) (2011) IACSIT Press, Singapore TIC: A Topic-based Intelligent Crawler Hossein Shahsavand Baghdadi and Bali Ranaivo-Malançon
More informationDepartment of Electronic Engineering FINAL YEAR PROJECT REPORT
Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:
More informationAn Approach to Detect Clones in Class Diagram Based on Suffix Array
An Approach to Detect Clones in Class Diagram Based on Suffix Array Amandeep Kaur, Computer Science and Engg. Department, BBSBEC Fatehgarh Sahib, Punjab, India. Manpreet Kaur, Computer Science and Engg.
More informationCLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval
DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,
More informationDLFinder: Characterizing and Detecting Duplicate Logging Code Smells
DLFinder: Characterizing and Detecting Duplicate Logging Code Smells Zhenhao Li, Tse-Hsun (Peter) Chen, Jinqiu Yang and Weiyi Shang Software PErformance, Analysis, and Reliability (SPEAR) Lab Department
More informationDesign First ITS Instructor Tool
Design First ITS Instructor Tool The Instructor Tool allows instructors to enter problems into Design First ITS through a process that creates a solution for a textual problem description and allows for
More informationA modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems
A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University
More informationAssessing the Quality of Natural Language Text
Assessing the Quality of Natural Language Text DC Research Ulm (RIC/AM) daniel.sonntag@dfki.de GI 2004 Agenda Introduction and Background to Text Quality Text Quality Dimensions Intrinsic Text Quality,
More informationMACHINE LEARNING FOR SOFTWARE MAINTAINABILITY
MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY Anna Corazza, Sergio Di Martino, Valerio Maggio Alessandro Moschitti, Andrea Passerini, Giuseppe Scanniello, Fabrizio Silverstri JIMSE 2012 August 28, 2012
More information13 AutoFocus 3 - A Scientific Tool Prototype for Model-Based Development of Component-Based, Reactive, Distributed Systems
13 AutoFocus 3 - A Scientific Tool Prototype for Model-Based Development of Component-Based, Reactive, Distributed Systems Florian Hölzl and Martin Feilkas Institut für Informatik Technische Universität
More informationMODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS
MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500
More informationCMSC 447: Software Engineering I
CMSC 447: Software Engineering I General Instructions System Requirements Specification Template (Adapted from Susan Mitchell and Michael Grasso) 1. Provide a cover page that includes the document name,
More informationEnabling of AUTOSAR system design using Eclipse-based tooling
Enabling of AUTOSAR system design using Eclipse-based tooling H. Heinecke, M. Rudorfer, P. Hoser, C. Ainhauser, O. Scheickl BMW Car IT GmbH, Petuelring 116, 80809 Munich, Germany Abstract: AUTOSAR is a
More informationA Case Study on the Similarity Between Source Code and Bug Reports Vocabularies
A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies Diego Cavalcanti 1, Dalton Guerrero 1, Jorge Figueiredo 1 1 Software Practices Laboratory (SPLab) Federal University of Campina
More informationHow to import text transcription
How to import text transcription This document explains how to import transcriptions of spoken language created with a text editor or a word processor into the Partitur-Editor using the Simple EXMARaLDA
More informationChapter 2 Overview of the Design Methodology
Chapter 2 Overview of the Design Methodology This chapter presents an overview of the design methodology which is developed in this thesis, by identifying global abstraction levels at which a distributed
More informationNatural Language Requirements
Natural Language Requirements Software Verification and Validation Laboratory Requirement Elaboration Heuristic Domain Model» Requirement Relationship Natural Language is elaborated via Requirement application
More informationTaxonomies and controlled vocabularies best practices for metadata
Original Article Taxonomies and controlled vocabularies best practices for metadata Heather Hedden is the taxonomy manager at First Wind Energy LLC. Previously, she was a taxonomy consultant with Earley
More informationQuality Indicators for Automotive Test Case Specifications
Quality Indicators for Automotive Test Case Specifications Katharina Juhnke Daimler AG Group Research & MBC Development Email: katharina.juhnke@daimler.com Matthias Tichy Ulm University Institute of Software
More informationAn Adaptive Framework for Named Entity Combination
An Adaptive Framework for Named Entity Combination Bogdan Sacaleanu 1, Günter Neumann 2 1 IMC AG, 2 DFKI GmbH 1 New Business Department, 2 Language Technology Department Saarbrücken, Germany E-mail: Bogdan.Sacaleanu@im-c.de,
More informationLicense.
License This document is licensed under the terms of the Creative Commons Attribution-Noncommercial 3.0 Germany license. You are allowed to to Share to copy, distribute, and transmit the document to Remix
More informationSentiment Classification of Food Reviews
Sentiment Classification of Food Reviews Hua Feng Department of Electrical Engineering Stanford University Stanford, CA 94305 fengh15@stanford.edu Ruixi Lin Department of Electrical Engineering Stanford
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationProgramming Exercise 6: Support Vector Machines
Programming Exercise 6: Support Vector Machines Machine Learning May 13, 2012 Introduction In this exercise, you will be using support vector machines (SVMs) to build a spam classifier. Before starting
More informationNatural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus
Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Donald C. Comeau *, Haibin Liu, Rezarta Islamaj Doğan and W. John Wilbur National Center
More informationJava Archives Search Engine Using Byte Code as Information Source
Java Archives Search Engine Using Byte Code as Information Source Oscar Karnalim School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia 23512012@std.stei.itb.ac.id
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationProofwriting Checklist
CS103 Winter 2019 Proofwriting Checklist Cynthia Lee Keith Schwarz Over the years, we ve found many common proofwriting errors that can easily be spotted once you know how to look for them. In this handout,
More informationDr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India
Volume 3, Issue 11, November 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Study of Different
More informationForm Identifying. Figure 1 A typical HTML form
Table of Contents Form Identifying... 2 1. Introduction... 2 2. Related work... 2 3. Basic elements in an HTML from... 3 4. Logic structure of an HTML form... 4 5. Implementation of Form Identifying...
More informationUsing Code Coverage to Improve the Reliability of Embedded Software. Whitepaper V
Using Code Coverage to Improve the Reliability of Embedded Software Whitepaper V2.0 2017-12 Table of Contents 1 Introduction... 3 2 Levels of Code Coverage... 3 2.1 Statement Coverage... 3 2.2 Statement
More informationtime now it has also been used productively in a multi-oem, requires precise knowledge of the protocol, the layout, the
ODX in Practice Experiences, challenges and potential The diagnostic exchange format ODX has been implemented successfully in a number of pilot projects. For the first time now it has also been used productively
More informationModeling Crisis Management System With the Restricted Use Case Modeling Approach
Modeling Crisis Management System With the Restricted Use Case Modeling Approach Gong Zhang 1, Tao Yue 2, and Shaukat Ali 3 1 School of Computer Science and Engineering, Beihang University, Beijing, China
More informationWorst-case running time for RANDOMIZED-SELECT
Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case
More informationVulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits
Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits Carl Sabottke Octavian Suciu Tudor Dumitraș University of Maryland 2 Problem Increasing number
More informationarxiv: v1 [cs.se] 22 Nov 2016
Will My Tests Tell Me If I Break This Code? Rainer Niedermayr, Elmar Juergens CQSE GmbH Garching b. München, Germany {niedermayr, juergens@cqse.eu Stefan Wagner University of Stuttgart Stuttgart, Germany
More informationChapter 4. Abstract Syntax
Chapter 4 Abstract Syntax Outline compiler must do more than recognize whether a sentence belongs to the language of a grammar it must do something useful with that sentence. The semantic actions of a
More informationINTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume
More informationCenter for Reflected Text Analytics. Lecture 2 Annotation tools & Segmentation
Center for Reflected Text Analytics Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory Guidelines Inter-Annotator agreement Inter-subjective annotations Annotation exercise Discuss
More informationAn Oracle White Paper October Oracle Social Cloud Platform Text Analytics
An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations
More informationCS2112 Fall Assignment 4 Parsing and Fault Injection. Due: March 18, 2014 Overview draft due: March 14, 2014
CS2112 Fall 2014 Assignment 4 Parsing and Fault Injection Due: March 18, 2014 Overview draft due: March 14, 2014 Compilers and bug-finding systems operate on source code to produce compiled code and lists
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationiserver Free Archimate ArchiMate 1.0 Template Stencil: Getting from Started Orbus Guide Software Thanks for Downloading the Free ArchiMate Template! Orbus Software have created a set of Visio ArchiMate
More informationDecision Management in the Insurance Industry: Standards and Tools
Decision Management in the Insurance Industry: Standards and Tools Kimon Batoulis 1, Alexey Nesterenko 2, Günther Repitsch 2, and Mathias Weske 1 1 Hasso Plattner Institute, University of Potsdam, Potsdam,
More informationRequirements. Chapter Learning objectives of this chapter. 2.2 Definition and syntax
Chapter 2 Requirements A requirement is a textual description of system behaviour. A requirement describes in plain text, usually English, what a system is expected to do. This is a basic technique much
More informationIII Data Structures. Dynamic sets
III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations
More informationClassifying Twitter Data in Multiple Classes Based On Sentiment Class Labels
Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),
More informationefmea RAISING EFFICIENCY OF FMEA BY MATRIX-BASED FUNCTION AND FAILURE NETWORKS
efmea RAISING EFFICIENCY OF FMEA BY MATRIX-BASED FUNCTION AND FAILURE NETWORKS Maik Maurer Technische Universität München, Product Development, Boltzmannstr. 15, 85748 Garching, Germany. Email: maik.maurer@pe.mw.tum.de
More informationPattern Mining in Frequent Dynamic Subgraphs
Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationLexical Analysis. Lecture 3. January 10, 2018
Lexical Analysis Lecture 3 January 10, 2018 Announcements PA1c due tonight at 11:50pm! Don t forget about PA1, the Cool implementation! Use Monday s lecture, the video guides and Cool examples if you re
More informationCustomisable Curation Workflows in Argo
Customisable Curation Workflows in Argo Rafal Rak*, Riza Batista-Navarro, Andrew Rowley, Jacob Carter and Sophia Ananiadou National Centre for Text Mining, University of Manchester, UK *Corresponding author:
More informationApplication documentation Documentation
Application documentation Documentation Release 0.1 Daniele Procida June 14, 2016 Contents 1 Tutorial 3 1.1 Setting up................................................. 3 1.2 Configuring the documentation.....................................
More informationTowards Systematic Usability Verification
Towards Systematic Usability Verification Max Möllers RWTH Aachen University 52056 Aachen, Germany max@cs.rwth-aachen.de Jonathan Diehl RWTH Aachen University 52056 Aachen, Germany diehl@cs.rwth-aachen.de
More informationKnowledge Extraction from German Automotive Software Requirements using NLP-Techniques and a Grammar-based Pattern Detection
Knowledge Extraction from German Automotive Software s using NLP-Techniques and a Grammar-based Pattern Detection Mathias Schraps Software Development Audi Electronics Venture GmbH 85080 Gaimersheim, Germany
More informationName: Lirong TAN 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G.
1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G. A shortest s-t path is a path from vertex to vertex, whose sum of edge weights is minimized. (b) Give the pseudocode
More informationCPSC 695. Data Quality Issues M. L. Gavrilova
CPSC 695 Data Quality Issues M. L. Gavrilova 1 Decisions Decisions 2 Topics Data quality issues Factors affecting data quality Types of GIS errors Methods to deal with errors Estimating degree of errors
More informationTurn Indicator Model Overview
Turn Indicator Model Overview Jan Peleska 1, Florian Lapschies 1, Helge Löding 2, Peer Smuda 3, Hermann Schmid 3, Elena Vorobev 1, and Cornelia Zahlten 2 1 Department of Mathematics and Computer Science
More informationOn the automatic classification of app reviews
Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please
More informationExploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge
Exploiting Internal and External Semantics for the Using World Knowledge, 1,2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 1 School of Computing National University of Singapore 2 School of Computer Science
More information[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632
More informationNatural Language to Database Interface
Natural Language to Database Interface Aarti Sawant 1, Pooja Lambate 2, A. S. Zore 1 Information Technology, University of Pune, Marathwada Mitra Mandal Institute Of Technology. Pune, Maharashtra, India
More informationIntroduction to Lexical Analysis
Introduction to Lexical Analysis Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions Examples
More informationApache UIMA ConceptMapper Annotator Documentation
Apache UIMA ConceptMapper Annotator Documentation Written and maintained by the Apache UIMA Development Community Version 2.3.1 Copyright 2006, 2011 The Apache Software Foundation License and Disclaimer.
More informationMetaNews: An Information Agent for Gathering News Articles On the Web
MetaNews: An Information Agent for Gathering News Articles On the Web Dae-Ki Kang 1 and Joongmin Choi 2 1 Department of Computer Science Iowa State University Ames, IA 50011, USA dkkang@cs.iastate.edu
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More information