An Information Retrieval Approach for Source Code Plagiarism Detection

Similar documents
CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

DCU at FIRE 2013: Cross-Language!ndian News Story Search

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Automatic Generation of Query Sessions using Text Segmentation

A Text-Based Approach to the ImageCLEF 2010 Photo Annotation Task

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Document Expansion for Text-based Image Retrieval at CLEF 2009

Information Retrieval. (M&S Ch 15)

Recovering Traceability Links between Code and Documentation

Chapter 6: Information Retrieval and Web Search. An introduction

ResPubliQA 2010

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

DCU and 2010: Ad-hoc and Data-Centric tracks

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

SourcererCC -- Scaling Code Clone Detection to Big-Code

Entity and Knowledge Base-oriented Information Retrieval

New Issues in Near-duplicate Detection

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

External Query Reformulation for Text-based Image Retrieval

United we fall, Divided we stand: A Study of Query Segmentation and PRF for Patent Prior Art Search

Chapter 6. Queries and Interfaces

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Dublin City University at CLEF 2005: Multi-8 Two-Years-On Merging Experiments

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

Collective Entity Resolution in Relational Data

Automated Tagging for Online Q&A Forums

ITS 2.0-ENABLED STATISTICAL MACHINE TRANSLATION AND TRAINING WEB SERVICES

Natural Language Processing

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Overview of SOCO Track on the Detection of SOurce COde Re-use

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

Modelling Structures in Data Mining Techniques

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Where Should the Bugs Be Fixed?

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

From Neural Re-Ranking to Neural Ranking:

WITHIN-DOCUMENT TERM-BASED INDEX PRUNING TECHNIQUES WITH STATISTICAL HYPOTHESIS TESTING. Sree Lekha Thota

Java Archives Search Engine Using Byte Code as Information Source

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Semantic Estimation for Texts in Software Engineering

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CHAPTER 5 Querying of the Information Retrieval System

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Information Retrieval

Query Refinement and Search Result Presentation

CMPSCI 646, Information Retrieval (Fall 2003)

Instructor: Stefan Savev

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY

Introduction to Information Retrieval. Lecture Outline

CS294-1 Final Project. Algorithms Comparison

Query-Free News Search

TREC-10 Web Track Experiments at MSRA

Book Recommendation based on Social Information

Improving Document Ranking for Long Queries with Nested Query Segmentation

Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection

Supplementary text S6 Comparison studies on simulated data

Inverted List Caching for Topical Index Shards

TRECVid 2013 Experiments at Dublin City University

Search Engines. Information Retrieval in Practice

End-to-End Neural Ad-hoc Ranking with Kernel Pooling

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

Information Retrieval Spring Web retrieval

Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track

Document Retrieval using Predication Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

IJRIM Volume 2, Issue 2 (February 2012) (ISSN )

Boolean Model. Hongning Wang

Predictive Indexing for Fast Search

NUS-I2R: Learning a Combined System for Entity Linking

A cocktail approach to the VideoCLEF 09 linking task

Query Expansion using Wikipedia and DBpedia

Word Indexing Versus Conceptual Indexing in Medical Image Retrieval

Effective Structured Query Formulation for Session Search

Rishiraj Saha Roy and Niloy Ganguly IIT Kharagpur India. Monojit Choudhury and Srivatsan Laxman Microsoft Research India India

PERSONALIZED TAG RECOMMENDATION

Evaluation of similarity metrics for programming code plagiarism detection method

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Finding Similar Items:Nearest Neighbor Search

Query Processing and Alternative Search Structures. Indexing common words

THE WEB SEARCH ENGINE

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

NTT SMT System for IWSLT Katsuhito Sudoh, Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki NTT Communication Science Labs.

CS145: INTRODUCTION TO DATA MINING

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Exploiting Spatial Code Proximity and Order for Improved Source Code Retrieval for Bug Localization

Eagle Eye. Sommersemester 2017 Big Data Science Praktikum. Zhenyu Chen - Wentao Hua - Guoliang Xue - Bernhard Fabry - Daly

Diversifying Query Suggestions Based on Query Documents

Hybrid Approach for Query Expansion using Query Log

Information Retrieval

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

A Class of Submodular Functions for Document Summarization

Transcription:

-2014: An Information Retrieval Approach for Source Code Plagiarism Detection Debasis Ganguly, Gareth J. F. Jones CNGL: Centre for Global Intelligent Content School of Computing, Dublin City University Dublin 9, Ireland. December 6, 2014

Overview Research Challenges Related Work Introduction Question answering forums and programming related blogs have made source code widely available to be read, copied and modified. Tendency to copy-paste code. Can result in IP violation. Research Challenge: Automatically detect cases of software plagiarism. SOCO Task@FIRE 2014: Dataset to evaluate plagiarism detection performance.

Overview Research Challenges Related Work Overview of our Approach We undertake an information retrieval (IR) approach to solve this problem. Advantages: No need for exhaustive pairwise comparison between documents. Faster than pair-wise exhaustive approaches. An ad-hoc query mechanism: a ranked list of candidate documents can be retrieved for a suspected document.

Overview Research Challenges Related Work Research Challenges Q1: Does a bag-of-words model (as used in standard IR) suffice or should the source code structure be used in some way to extract more meaningful pieces of information? Q2: How to index the source code documents so that a retrieval model can best make use of the indexed terms to retrieve relevant (plagiarized) documents at top ranks? Q3: How to represent a source code document as a pseudo-query, i.e. what are the most likely representative terms in a source code document?

Overview Research Challenges Related Work Limitations of Existing Methods Near Duplicate Document Detection by Shingling. Only a part of the source code is copied for reuse. Occurrences of exact duplicates at the level of whole documents is rare. Jacard coefficient of the shingles is expected to be low. Bag-of-words Model False high similarity values between non-plagiarized documents pairs. Because of the use of similar programming language specific constructs and keywords. Example: Programs tend to use a frequent set of variable names, specially for looping constructs, e.g. i, j, k etc.

Overview Research Challenges Related Work Limitations of Existing Methods (Contd.) Exhaustive Pair-wise Similarity Most popular approach that works well in practice. Does not scale well for large collections. Often tested on a small collection, e.g. (Chae et. al., 2013) uses a collection of 56 programs for evaluation.

Document Representation Retrieval Source Code Parsing Different parts of source codes need to be matched differently. Example: A program which defines a class HelloWorld vs. a program which defines a string literal HelloWorld. Parse the source code to build an annotated syntax tree (AST). Extract terms from specific nodes of the AST.

Document Representation Retrieval Source Code Parsing (Contd.) Table: Annotated Syntax Tree nodes of a Java program from which terms are extracted during indexing. Field Name Field Description Classes The names of the classes used in a Java source Method calls The method names and actual parameter names and types String literals Values of the string constants Arrays Names of arrays and dimensions Method definitions Names of methods and formal parameter names and types Assignment statements Variable names and types Package imports Names of imported packages Comments

Document Representation Retrieval Pseudo-Queries from Source Codes It is not reasonable to use whole source code documents as a pseudo-queries: A part of the source code is typically copy-pasted. Extract a pre-set number of terms from each field of a document by a term scoring function. We use the LM term scoring function. tf (t, f, d) (t) LM(t, f, d) = λ + (1 λ)cf len(f, d) cs (1)

Document Representation Retrieval Plagiarized Set from Ranked List Thresholding on the relative drops in similarity values of the ranked list. Plag(Q) = {D i : sim(q, D i) sim(q, D i 1 ) ɛ} (2) sim(q, D i 1 )

Baselines Run-1: Standard LM retrieval: Flat bag-of-words representation. Source codes treated as non-structured text documents. Index does not constitute of separate fields. Run-2: Terms extracted from the selected nodes of the AST. No separate fields are used to store them. Other approaches explored (not submitted in the SOCO task): Using different number of terms while constructing the pseudo-queries. Unigram and bi-gram (word level) indexing.

Training Set Results Parametes Metric AST Fields #Qry terms n-gram Precision Recall F-score no no all 1 0.8000 0.0952 0.1702 no no 50 1 0.6363 0.4166 0.5035 yes no all 1 0.7778 0.0833 0.1505 yes no 50 1 0.7631 0.3452 0.4754 yes yes all 1 0.8235 0.1666 0.2772 yes yes 50 1 0.7894 0.3571 0.4918 no no all 2 0.7667 0.2738 0.4035 no no 50 2 0.7200 0.4285 0.5373 yes no all 2 0.7391 0.2023 0.3177 yes no 50 2 0.5714 0.2857 0.3809 yes yes all 2 0.7826 0.2142 0.3364 yes yes 50 2 0.6842 0.6190 0.6500

Test Set (Official SOCO) Results Run Parametes Metric Parse Fields #terms n-gram Prec. Rcll F-sc dcu-run1 no no 50 2 0.432 0.995 0.602 dcu-run2 yes no 50 2 0.530 0.995 0.692 dcu-run3 yes yes 50 2 0.515 1.000 0.680 Test set results somewhat different from the training set ones. A flat index constituted from the AST terms produces very good results (which was not the case for the training set). Flat indexing with no parsing worse precision (and hence F-score) in comparison to the parsing based approaches. Surprisingly, field based LM does not turn out to be more effective than the standard bag-of-words LM.

Conclusions Q1: Program structure is important for determining source code plagiarism. (Both the development set and the official results empirically confirm this). Q2: Field based retrieval model improves results further? Inconclusive from the results of the development and the test sets. Q3: Results show that an LM based term selection method of selecting representative terms works significantly better than using all terms for pseudo-query formulation.