Detection of Text Plagiarism and Wikipedia Vandalism

Size: px

Start display at page:

Download "Detection of Text Plagiarism and Wikipedia Vandalism"

Lily McDaniel
6 years ago
Views:

1 Detection of Text Plagiarism and Wikipedia Vandalism Benno Stein Bauhaus-Universität Weimar Keynote at SEPLN, Valencia, 8. Sep. 2010

2 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin Potthast Matthias Hagen Nedim Lipka Peter Prettenhofer Tim Gollub Christin Gläser

3 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin Potthast Matthias Hagen Nedim Lipka Peter Prettenhofer Tim Gollub Christin Gläser

4 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin Potthast Matthias Hagen Nedim Lipka Peter Prettenhofer Tim Gollub Christin Gläser

5 Outline External Plagiarism Detection The PAN Competition Intrinsic Plagiarism Detection Vandalism Detection in Wikipedia The PAN Competition Continued + 5 [ ] c

work, in whole or in part, into one s own without adequate

6 Plagiarism is the practice of claiming, or implying, original authorship of someone else s written or creative work, in whole or in part, into one s own without adequate acknowledgment. [Wikipedia: Plagiarism, 2009] 8 [ ] c

7 ... better technology nowadays ; ) + 9 [ ] c

8 ... better technology nowadays ; ) +? 10 [ ] c

9 Is plagiarism a problem with respect to education? Is there a misunderstanding wrt. an evolving cultural technique? ( a service that exploits the unacknowledged wisdom of the crowd.) Can plagiarsim be detected by humans? Can plagiarsim be detected by machines? Should automatic plagiarism detection algorithms become standard? 11 [ ] c

10 Is plagiarism a problem with respect to education? Is there a misunderstanding wrt. an evolving cultural technique? ( a service that exploits the unacknowledged wisdom of the crowd.) Can plagiarsim be detected by humans? Can plagiarsim be detected by machines? Should automatic plagiarism detection algorithms become standard? 12 [ ] c

11 Is plagiarism a problem with respect to education? Is there a misunderstanding wrt. an evolving cultural technique? ( a service that exploits the unacknowledged wisdom of the crowd.) Can plagiarsim be detected by humans? Can plagiarsim be detected by machines? Should automatic plagiarism detection algorithms become standard? For several reasons we should say text reuse rather than plagiarism. 13 [ ] c

12 External Plagiarism Detection 14 [ ] c

13 External Plagiarism Detection How Humans Spot Plagiarism 15 [ ] c

14 External Plagiarism Detection How Humans Spot Plagiarism 16 [ ] c

15 External Plagiarism Detection How Humans Spot Plagiarism + Exploits human intuition for peculiar passages. + Exploits human experience to analyze the search engine results. + Is applied easily and in an ad-hoc manner. 17 [ ] c

External Plagiarism Detection How Humans Spot Plagiarism + Exploits human intuition for peculiar passages. + Exploits human experience to analyze the search engine results.

16 External Plagiarism Detection How Humans Spot Plagiarism + Exploits human intuition for peculiar passages. + Exploits human experience to analyze the search engine results. + Is applied easily and in an ad-hoc manner. Cannot be done on large scale. Depends on (commercial) third-party services. Fails in the case of obfuscated / modified text. Cannot be used to find the reuse of structure or argumentation lines. 18 [ ] c

17 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 19 [ ] c

18 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 20 [ ] c

19 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Where are the crucial keywords? Check for noun phrases. Find orthographic mistakes. Consider word frequency classes. But, don t look in titles, captions, or headings. 21 [ ] c

20 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 22 [ ] c

21 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 23 [ ] c

External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis

22 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 24 [ ] c

23 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 25 [ ] c

24 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 26 [ ] c

25 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support The Query Cover Problem. 27 [ ] c

26 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Given: 1. A set W of keywords. 2. A query interface for a Web search engine S. 3. An upper bound k on the result list length. User over Ranking Todo: Find a family of queries Q covering W yielding at most k Web results. 29 [ ] c

27 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 30 [ ] c

28 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 basic complex Technology MD5 hashing Hashed breakpoint chunking Fuzzy-fingerprinting Dot plotting What can be detected Identity analysis for paragraphs Synchronized identity analysis for paragraphs Tolerant similarity analysis for paragraphs Sequences of word n-grams 31 [ ] c

29 External Plagiarism Detection Algorithms for Machines: Pairwise Comparison 1. Partition each document in meaningful sections, also called chunks. 2. Do a pairwise comparison using a similarity function ϕ. OnWeb-based Plagiarism Analysis Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Alexander.Kleppe@medien.uni-weimar.de, Dennis.Braunsdorf@medien.uni-weimar.de, Christoph.Loessnitz@medien.uni-weimar.de, Sven.Meyer-zu-Eissen@medien.uni-weimar.de Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs. ϕ Our Web-based plagiarism analysis application takes a suspicious document from an a-priori unknown domain as input. Consequently, an unsupervised, domainindependent keyword extraction algorithm that takes a single document as input would be convenient, language independence being a plus. Matsuo and Ishizuka propose such a method; it is based on a 2-analysis of term co-occurence data [6]. 2.2 Query Generation: Focussing Search When keywords are extracted from the suspicious document, we employ a heuristic query generation procedure, which was first presented in [12]. Let K1 denote the set of keywords that have been extracted from a suspicious document. By adding synonyms, coordinate terms, and derivationally related forms, the set K1 is extended towards a setk2 [2].WithinK2 groups of words are identified by exploiting statistical knowledge about significant left and right neighbors, as well as adequate co-occurring words, yielding the set K3 [13]. Then, a sequence of queries is generated (and passed to search engines). This selection step is controlled by quantitative relevance feedback: Depending on the number of found documents more or less esoteric queries are generated. Note that such a control can be realized by a heuristic ordering of the set K3, which considers word group sizes and word frequency classes [14]. The result of this step is a candidate document collection C = {d1,..., dn}. 3 Plagiarism Analysis As outlined above, a document may be plagiarized in different forms. Consequently, several indications exist to suspect a document of plagiarism. An adoption of indications that are given in [9] is as follows. (1) Copied text. If text stems from a source that is known and it is not cited properly then this is an obvious case of plagiarism. (2) Bibliography. If the references in documents overlap significantly, the bibliography and other parts may be copied. A changing citing style may be a sign for plagiarism. (3) Change in writing style. A suspect change in the author s style may appear paragraph- or section-wise, e.g. between objective and subjective style, nominaland verbal style, brillant and baffling passages. (4) Change in formatting. In copy-and-paste plagiarism cases the formatting of the original document is inherited to pasted paragraphs, especially when content is copied from browsers to text processing programs. (5) Textual patchwork. If the line of argumentation throughout a document is consequently incoherent then the document may be a mixed plagiate, i.e. a compilation of different sources. suspicious document corpus documents Complexity: n documents in corpus, c chunks per document on average O(n c 2 ) comparisons 32 [ ] c

30 External Plagiarism Detection Algorithms for Machines: MD5 Hashing 1. Partition each document into equidistant sections. 2. Compute hash values of the chunks using a hash function h. 3. Put all hashes into a hash table. A collision indicates matching chunks. OnWeb-based Plagiarism Analysis Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Alexander.Kleppe@medien.uni-weimar.de, Dennis.Braunsdorf@medien.uni-weimar.de, Christoph.Loessnitz@medien.uni-weimar.de, Sven.Meyer-zu-Eissen@medien.uni-weimar.de Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs. h=9154 h=2232 Our Web-based plagiarism analysis application takes a suspicious document from an a-priori unknown domain as input. Consequently, an unsupervised, domainindependent keyword extraction algorithm that takes a single document as input would be convenient, language independence being a plus. Matsuo and Ishizuka propose such a method; it is based on a 2-analysis of term co-occurence data [6]. 2.2 Query Generation: Focussing Search When keywords are extracted from the suspicious document, we employ a heuristic query generation procedure, which was first presented in [12]. Let K1 denote the set of keywords that have been extracted from a suspicious document. By adding synonyms, coordinate terms, and derivationally related forms, the set K1 is extended towards a setk2 [2].WithinK2 groups of words are identified by exploiting statistical knowledge about significant left and right neighbors, as well as adequate co-occurring words, yielding the set K3 [13]. Then, a sequence of queries is generated (and passed to search engines). This selection step is controlled by quantitative relevance feedback: Depending on the number of found documents more or less esoteric queries are generated. Note that such a control can be realized by a heuristic ordering of the set K3, which considers word group sizes and word frequency classes [14]. The result of this step is a candidate document collection C = {d1,..., dn}. 3 Plagiarism Analysis As outlined above, a document may be plagiarized in different forms. Consequently, several indications exist to suspect a document of plagiarism. An adoption of indications that are given in [9] is as follows. (1) Copied text. If text stems from a source that is known and it is not cited properly then this is an obvious case of plagiarism. (2) Bibliography. If the references in documents overlap significantly, the bibliography and other parts may be copied. A changing citing style may be a sign for plagiarism. (3) Change in writing style. A suspect change in the author s style may appear paragraph- or section-wise, e.g. between objective and subjective style, nominaland verbal style, brillant and baffling passages. (4) Change in formatting. In copy-and-paste plagiarism cases the formatting of the original document is inherited to pasted paragraphs, especially when content is copied from browsers to text processing programs. (5) Textual patchwork. If the line of argumentation throughout a document is consequently incoherent then the document may be a mixed plagiate, i.e. a compilation of different sources. suspicious document corpus documents Complexity: n documents in corpus, c chunks per document on average O(n c) operations (fingerprint generation, hash table operations) 33 [ ] c

31 External Plagiarism Detection Algorithms for Machines: Hashed Breakpoint Chunking 1. Partition each document into synchronized sections. 2. Compute hash values of the chunks using a hash function h. 3. Put all hashes into a hash table. A collision indicates matching chunks. OnWeb-based Plagiarism Analysis Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Alexander.Kleppe@medien.uni-weimar.de, Dennis.Braunsdorf@medien.uni-weimar.de, Christoph.Loessnitz@medien.uni-weimar.de, Sven.Meyer-zu-Eissen@medien.uni-weimar.de Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs. h=3294 h=7439 Our Web-based plagiarism analysis application takes a suspicious document from an a-priori unknown domain as input. Consequently, an unsupervised, domainindependent keyword extraction algorithm that takes a single document as input would be convenient, language independence being a plus. Matsuo and Ishizuka propose such a method; it is based on a 2-analysis of term co-occurence data [6]. 2.2 Query Generation: Focussing Search When keywords are extracted from the suspicious document, we employ a heuristic query generation procedure, which was first presented in [12]. Let K1 denote the set of keywords that have been extracted from a suspicious document. By adding synonyms, coordinate terms, and derivationally related forms, the set K1 is extended towards a setk2 [2].WithinK2 groups of words are identified by exploiting statistical knowledge about significant left and right neighbors, as well as adequate co-occurring words, yielding the set K3 [13]. Then, a sequence of queries is generated (and passed to search engines). This selection step is controlled by quantitative relevance feedback: Depending on the number of found documents more or less esoteric queries are generated. Note that such a control can be realized by a heuristic ordering of the set K3, which considers word group sizes and word frequency classes [14]. The result of this step is a candidate document collection C = {d1,..., dn}. 3 Plagiarism Analysis As outlined above, a document may be plagiarized in different forms. Consequently, several indications exist to suspect a document of plagiarism. An adoption of indications that are given in [9] is as follows. (1) Copied text. If text stems from a source that is known and it is not cited properly then this is an obvious case of plagiarism. (2) Bibliography. If the references in documents overlap significantly, the bibliography and other parts may be copied. A changing citing style may be a sign for plagiarism. (3) Change in writing style. A suspect change in the author s style may appear paragraph- or section-wise, e.g. between objective and subjective style, nominaland verbal style, brillant and baffling passages. (4) Change in formatting. In copy-and-paste plagiarism cases the formatting of the original document is inherited to pasted paragraphs, especially when content is copied from browsers to text processing programs. (5) Textual patchwork. If the line of argumentation throughout a document is consequently incoherent then the document may be a mixed plagiate, i.e. a compilation of different sources. suspicious document corpus documents Complexity: n documents in corpus, c chunks per document on average O(n c) operations (fingerprint generation, hash table operations) 34 [ ] c

32 External Plagiarism Detection Algorithms for Machines: Fuzzy-fingerprinting Standard hashing: Equal chunks yield the same hash key: h(c 1 ) = h(c 2 ) c 1, c 2 are equal. Problem: sensitive to smallest changes. Fuzzy-fingerprinting: Different but similar chunks yield the same hash key: h ϕ (c 1 ) = h ϕ (c 2 ) c 1, c 2 are similar with high probability. Approach: abstraction by reducing the alphabet, neglecting word order. Problem: similarity-sensitive hash functions suffer from a low recall. 36 [ ] c

33 External Plagiarism Detection Algorithms for Machines: Dot Plotting Geometric sequence analysis of all word 4-grams of two interesting documents. 37 [ ] c

34 External Plagiarism Detection Algorithms for Machines: Dot Plotting Geometric sequence analysis of all word 4-grams of two interesting documents. 38 [ ] c

35 External Plagiarism Detection Algorithms for Machines: Dot Plotting Geometric sequence analysis of all word 4-grams of two interesting documents. 39 [ ] c

36 External Plagiarism Detection Algorithms for Machines: Dot Plotting Level 1 (black): each dot indicates a common word 4-gram (hash collision). 40 [ ] c

37 External Plagiarism Detection Algorithms for Machines: Dot Plotting Level 2 (blue): neighbored common 4-grams are heuristically comprised. 41 [ ] c

38 External Plagiarism Detection Algorithms for Machines: Dot Plotting Level 3 (red): blue groups are merged by a cluster analysis (DBscan). 42 [ ] c

39 External Plagiarism Detection Algorithms for Machines: Dot Plotting The involved text of a cluster forms a plagiarism candidate. 43 [ ] c

40 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 44 [ ] c

41 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Check for problematic decisions: Citation analysis (can be problematic: consider an excuse citation in a footnote along with a completely reused text) Comparison of authors and co-authors 45 [ ] c

42 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 How to overcome the language barrier: Machine translation services Mapping into a concept space (ESA, CL-ESA) 46 [ ] c

43 The PAN Competition 47 [ ] c

44 The PAN Competition 2nd International Competition on Plagiarism Detection, PAN 2010 Facts: organized as CLEF 2010 Lab 18 groups from 12 countries participated 15 weeks of training and testing (March June) training corpus was the corpus PAN-PC-09 test corpus was the PAN-PC-10, a new version of last year s corpus. incidentally, the 1st competition was held at SEPLN 09. Task: Given a set of suspicious documents and a set of source documents, find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections. 49 [ ] c

45 The PAN Competition Plagiarism Corpus PAN-PC-10 1 Large-scale resource for the controlled evaluation of detection algorithms: documents (obtained from books from the Project Gutenberg 2 ) plagiarism cases (about 0-10 cases per document) [1] [2] PAN-PC-10 addresses a broad range of plagiarism situations by varying reasonably within the following parameters: 1. document length 2. document language 3. detection task 4. plagiarism case length 5. plagiarism case obfuscation 6. plagiarism case topic alignment 51 [ ] c

46 The PAN Competition PAN-PC-10 Document Statistics 100% documents Document length: 50% short (1-10 pages) 35% medium ( pages) 15% long ( pp.) Document language: 80% English 10% de 10% es Detection task: 70% external analysis 30% intrinsic analysis plagiarized unmodified (plagiarism source) plagiarized unmodified Plagiarism fraction per document [%] [ ] c

47 The PAN Competition PAN-PC-10 Plagiarism Case Statistics 100% plagiarism cases Plagiarism case length: 34% short ( words) 33% medium ( words) 33% long ( words) Plagiarism case obfuscation: 40% none 40% artificial 3 6% 4 14% 5 low obfuscation high obfuscation AMT de es [3] Artificial plagiarism: algorithmic obfuscation. [4] Simulated plagiarism: obfuscation via Amazon Mechanical Turk. [5] Cross-language plagiarism: obfuscation due to machine translation de en and es en. Plagiarism case topic alignment: 50% intra-topic 50% inter-topic 59 [ ] c

48 The PAN Competition Plagiarism Detection Results Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene Plagdet Plagdet combines precision, recall, and granularity. Precision and recall are well-known, yet not well-defined. Granularity measures the number of times a single plagiarism case has been detected. [Potthast et al., 2010] [ ] c

49 The PAN Competition Plagiarism Detection Results Recall Precision Granularity Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene [ ] c

50 Intrinsic Plagiarism Detection 62 [ ] c

51 Intrinsic Plagiarism Detection How Humans Spot Plagiarism When no corpus along with a powerful search engine is at hand [ ] c

52 Intrinsic Plagiarism Detection How Humans Spot Plagiarism When no corpus along with a powerful search engine is at hand... look for style changes check for peculiarities (orthographic mistakes, typographical habits) listen to the instincts (perhaps the most powerful technology ) 64 [ ] c

53 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 65 [ ] c

54 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 66 [ ] c

55 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 How large is the fraction θ of plagiarized text? document length analysis genre analysis (e.g. scientific article versus editorial) analysis of issuing institution 67 [ ] c

56 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 68 [ ] c

57 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 How to find text positions where plagiarism starts or ends? basic complex uniform length chunking (simple but naive) structural boundaries (chapters, paragraphs, tables, captions) topical boundaries (difficult but powerful) stylistic boundaries (best, but usually intractable) 69 [ ] c

58 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 70 [ ] c

59 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 The question of Stylometry: How to quantify writing style? basic complex structural features (paragraph lengths, use of tables, signatures) character-based lexical features (n-gram frequency, compression rate) word-based lexical features (readability, writing complexity) syntactic features (part of speech, function words) dialectic power and argumentation 71 [ ] c

60 Intrinsic Plagiarism Detection Algorithms for Machines: Style Features that Work Stylometric feature F Measure Flesch Reading Ease Score Average number of syllables per word Frequency of term: of Noun-Verb-Nountri-gram Noun-Noun-Verbtri-gram Verb-Noun-Nountri-gram Gunning Fog index Yule s K measure Flesch Kincaid grade level Average word length Noun-Preposition-ProperNountri-gram Honore s R measure Average word length Average word frequency class Consonant-Vowel-Consonanttri-gram Frequency of term: is Noun-Noun-CoordinatingConjunctiontri-gram NounPlural-Preposition-Determinertri-gram Determiner-NounPlural-Prepositiontri-gram Consonant-Vowel-Voweltri-gram [ ] c

61 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 74 [ ] c

62 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 One-class classification: we have an idea about positive examples only. : ( Density methods try to model a style feature s distribution. Boundary methods try to cluster text portions of similar style. Reconstruction methods quantify the style generation error under the average style model. 75 [ ] c

63 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany 0 ( ) 0 ( ) 1 ( )... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs style vectors of chunks 80 [ ] c

64 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs ( )... style vectors of chunks Style feature Assume that style features of outliers (plagiarized text) are uniformly distributed. 81 [ ] c

65 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany 0 ( ) 0 ( ) 1 ( )... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs style vectors of chunks Style feature Assume that style features of original text are Gaussian distributed. 82 [ ] c

66 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany 0 ( ) 0 ( ) 1 ( )... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs style vectors of chunks Style feature H 1 H 0 H 1 uncertainty interval uncertainty interval Compute maximum a-posteriori hypothesis under Naive Bayes. 83 [ ] c

67 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 84 [ ] c

68 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Since we are still unsecure... How to obtain additional evidence about authorship? Raise precision at the expense of recall. (analyze ROC characteristic) If sufficient text is available, apply authorship verification technology. (unmasking) 85 [ ] c

69 Vandalism Detection in Wikipedia 86 [ ] c

70 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, word existence 87 [ ] c

71 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, word existence 88 [ ] c

72 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, repetition 89 [ ] c

73 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, repetition 90 [ ] c

74 Vandalism Detection in Wikipedia Example: vulgarism, sentiment 91 [ ] c

75 Vandalism Detection in Wikipedia Example: vulgarism, sentiment 92 [ ] c

76 Vandalism Detection in Wikipedia Example: special chars, spacing 93 [ ] c

77 Vandalism Detection in Wikipedia Example: special chars, spacing 94 [ ] c

78 Vandalism Detection in Wikipedia Example: misguided helping 95 [ ] c

79 Vandalism Detection in Wikipedia Example: misguided helping 96 [ ] c

80 Vandalism Detection in Wikipedia Example: wrong facts, defamation 97 [ ] c

81 Vandalism Detection in Wikipedia Example: wrong facts, defamation 98 [ ] c

82 Vandalism Detection in Wikipedia Example: opinionated 99 [ ] c

83 Vandalism Detection in Wikipedia Example: opinionated 100 [ ] c

84 Vandalism Detection in Wikipedia Example: wrong facts, nonsense 101 [ ] c

85 Vandalism Detection in Wikipedia Example: wrong facts, nonsense 102 [ ] c

86 Vandalism Detection in Wikipedia Example: wrong facts, article history may suggest otherwise 103 [ ] c

87 Vandalism Detection in Wikipedia The Machine Learning Perspective The achievements of ML enfold their full power in discrimination situations. The tasks intrinsic plagiarism analysis authorship verification vandalism detection share a particular characteristic: they are one-class classification problems. Feature engineering plays an outstanding role. 105 [ ] c

88 Vandalism Detection in Wikipedia Two Types of Edit Features 106 [ ] c

89 Vandalism Detection in Wikipedia Two Types of Edit Features: Content-based Feature Description Character-level Features Capitalization Ratio of upper case chars to lower case chars (all chars) Distribution Kullback-Leibler divergence of the char distribution from the expectation Compressibility Compression rate of the edit differences Markup Ratio of new (changed) wikitext chars to all wikitext chars Word-level Features Vulgarism Pronouns Sentiment Frequency of vulgar words Frequency of personal pronouns Frequency of sentiment words Spelling and Grammar Features Word Existence Ratio of words that occur in an English dictionary Spelling Frequency (impact) of spelling errors Grammar Number of grammatical errors Edit Type Features Edit Type Replacement The edit is an insertion, deletion, modification, or a combination The article (a paragraph) is completely replaced, excluding its title 107 [ ] c

90 Vandalism Detection in Wikipedia Two Types of Edit Features: Context-based Feature Description Edit Comment Features Existence A comment was given Length Length of the comment Edit Time Features Edit time Successiveness Hour of the day the edit was made Logarithm of the time difference to the previous edit Article Revision History Features Revisions Number of revisions Regular Number of regular edits Article Trustworthiness Features Suspect Topic The article is on the list of often vandalized articles WikiTrust Values from the WikiTrust trust histogram Editor Reputation Features Anonymous Anonymous editor Reputation Scores that compute a user s reputation based on previous edits Registration Time the editor was registered with Wikipedia 108 [ ] c

91 The PAN Competition Continued 109 [ ] c

92 The PAN Competition Continued 1st International Competition on Wikipedia Vandalism Detection, PAN 2010 Facts: organized as CLEF 2010 Lab 9 groups from 5 countries participated, 5 groups from the USA 15 weeks of training and testing (March June) the corpus was newly created for the purpose of the competition Task: Given a set of edits on Wikipedia articles, distinguish ill-intentioned edits from well-intentioned edits. 111 [ ] c

93 The PAN Competition Continued Vandalism Corpus PAN-WVC-10 Large-scale resource for the controlled evaluation of detection algorithms: edits (sampled from a week s worth of Wikipedia edit logs) different edited articles (edit frequency resembles article importance) 2391 edits are vandalism (a 7% ratio is in concordance with the literature) The edits in PAN-WVC-10 have been reviewed by 753 human annotators, recruited at Amazon s Mechanical Turk: Each edit was reviewed by at least 3 different annotators. If the annotators did not agree, the edit was reviewed again by 3 other. If still less than 2/3 of the annotators agreed, 3 more annotators were asked. After 8 iterations only 70 edits remained in a tie, which proofed to be tough choices. 112 [ ] c

94 The PAN Competition Continued Vandalism Detection Results 113 [ ] c

95 The PAN Competition Continued Vandalism Detection Results 117 [ ] c

96 The PAN Competition Continued Vandalism Detection Results 122 [ ] c

97 The PAN Competition Continued Vandalism Detection Results 123 [ ] c

Candidate Document Retrieval for Web-scale Text Reuse Detection

Candidate Document Retrieval for Web-scale Text Reuse Detection Matthias Hagen Benno Stein Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de SPIRE 2011 Pisa, Italy October 19, 2011 Matthias Hagen,