Detection of Text Plagiarism and Wikipedia Vandalism
|
|
- Lily McDaniel
- 6 years ago
- Views:
Transcription
1 Detection of Text Plagiarism and Wikipedia Vandalism Benno Stein Bauhaus-Universität Weimar Keynote at SEPLN, Valencia, 8. Sep. 2010
2 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin Potthast Matthias Hagen Nedim Lipka Peter Prettenhofer Tim Gollub Christin Gläser
3 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin Potthast Matthias Hagen Nedim Lipka Peter Prettenhofer Tim Gollub Christin Gläser
4 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin Potthast Matthias Hagen Nedim Lipka Peter Prettenhofer Tim Gollub Christin Gläser
5 Outline External Plagiarism Detection The PAN Competition Intrinsic Plagiarism Detection Vandalism Detection in Wikipedia The PAN Competition Continued + 5 [ ] c
6 Plagiarism is the practice of claiming, or implying, original authorship of someone else s written or creative work, in whole or in part, into one s own without adequate acknowledgment. [Wikipedia: Plagiarism, 2009] 8 [ ] c
7 ... better technology nowadays ; ) + 9 [ ] c
8 ... better technology nowadays ; ) +? 10 [ ] c
9 Is plagiarism a problem with respect to education? Is there a misunderstanding wrt. an evolving cultural technique? ( a service that exploits the unacknowledged wisdom of the crowd.) Can plagiarsim be detected by humans? Can plagiarsim be detected by machines? Should automatic plagiarism detection algorithms become standard? 11 [ ] c
10 Is plagiarism a problem with respect to education? Is there a misunderstanding wrt. an evolving cultural technique? ( a service that exploits the unacknowledged wisdom of the crowd.) Can plagiarsim be detected by humans? Can plagiarsim be detected by machines? Should automatic plagiarism detection algorithms become standard? 12 [ ] c
11 Is plagiarism a problem with respect to education? Is there a misunderstanding wrt. an evolving cultural technique? ( a service that exploits the unacknowledged wisdom of the crowd.) Can plagiarsim be detected by humans? Can plagiarsim be detected by machines? Should automatic plagiarism detection algorithms become standard? For several reasons we should say text reuse rather than plagiarism. 13 [ ] c
12 External Plagiarism Detection 14 [ ] c
13 External Plagiarism Detection How Humans Spot Plagiarism 15 [ ] c
14 External Plagiarism Detection How Humans Spot Plagiarism 16 [ ] c
15 External Plagiarism Detection How Humans Spot Plagiarism + Exploits human intuition for peculiar passages. + Exploits human experience to analyze the search engine results. + Is applied easily and in an ad-hoc manner. 17 [ ] c
16 External Plagiarism Detection How Humans Spot Plagiarism + Exploits human intuition for peculiar passages. + Exploits human experience to analyze the search engine results. + Is applied easily and in an ad-hoc manner. Cannot be done on large scale. Depends on (commercial) third-party services. Fails in the case of obfuscated / modified text. Cannot be used to find the reuse of structure or argumentation lines. 18 [ ] c
17 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 19 [ ] c
18 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 20 [ ] c
19 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Where are the crucial keywords? Check for noun phrases. Find orthographic mistakes. Consider word frequency classes. But, don t look in titles, captions, or headings. 21 [ ] c
20 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 22 [ ] c
21 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 23 [ ] c
22 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 24 [ ] c
23 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 25 [ ] c
24 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 26 [ ] c
25 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support The Query Cover Problem. 27 [ ] c
26 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Given: 1. A set W of keywords. 2. A query interface for a Web search engine S. 3. An upper bound k on the result list length. User over Ranking Todo: Find a family of queries Q covering W yielding at most k Web results. 29 [ ] c
27 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 30 [ ] c
28 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 basic complex Technology MD5 hashing Hashed breakpoint chunking Fuzzy-fingerprinting Dot plotting What can be detected Identity analysis for paragraphs Synchronized identity analysis for paragraphs Tolerant similarity analysis for paragraphs Sequences of word n-grams 31 [ ] c
29 External Plagiarism Detection Algorithms for Machines: Pairwise Comparison 1. Partition each document in meaningful sections, also called chunks. 2. Do a pairwise comparison using a similarity function ϕ. OnWeb-based Plagiarism Analysis Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Alexander.Kleppe@medien.uni-weimar.de, Dennis.Braunsdorf@medien.uni-weimar.de, Christoph.Loessnitz@medien.uni-weimar.de, Sven.Meyer-zu-Eissen@medien.uni-weimar.de Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs. ϕ Our Web-based plagiarism analysis application takes a suspicious document from an a-priori unknown domain as input. Consequently, an unsupervised, domainindependent keyword extraction algorithm that takes a single document as input would be convenient, language independence being a plus. Matsuo and Ishizuka propose such a method; it is based on a 2-analysis of term co-occurence data [6]. 2.2 Query Generation: Focussing Search When keywords are extracted from the suspicious document, we employ a heuristic query generation procedure, which was first presented in [12]. Let K1 denote the set of keywords that have been extracted from a suspicious document. By adding synonyms, coordinate terms, and derivationally related forms, the set K1 is extended towards a setk2 [2].WithinK2 groups of words are identified by exploiting statistical knowledge about significant left and right neighbors, as well as adequate co-occurring words, yielding the set K3 [13]. Then, a sequence of queries is generated (and passed to search engines). This selection step is controlled by quantitative relevance feedback: Depending on the number of found documents more or less esoteric queries are generated. Note that such a control can be realized by a heuristic ordering of the set K3, which considers word group sizes and word frequency classes [14]. The result of this step is a candidate document collection C = {d1,..., dn}. 3 Plagiarism Analysis As outlined above, a document may be plagiarized in different forms. Consequently, several indications exist to suspect a document of plagiarism. An adoption of indications that are given in [9] is as follows. (1) Copied text. If text stems from a source that is known and it is not cited properly then this is an obvious case of plagiarism. (2) Bibliography. If the references in documents overlap significantly, the bibliography and other parts may be copied. A changing citing style may be a sign for plagiarism. (3) Change in writing style. A suspect change in the author s style may appear paragraph- or section-wise, e.g. between objective and subjective style, nominaland verbal style, brillant and baffling passages. (4) Change in formatting. In copy-and-paste plagiarism cases the formatting of the original document is inherited to pasted paragraphs, especially when content is copied from browsers to text processing programs. (5) Textual patchwork. If the line of argumentation throughout a document is consequently incoherent then the document may be a mixed plagiate, i.e. a compilation of different sources. suspicious document corpus documents Complexity: n documents in corpus, c chunks per document on average O(n c 2 ) comparisons 32 [ ] c
30 External Plagiarism Detection Algorithms for Machines: MD5 Hashing 1. Partition each document into equidistant sections. 2. Compute hash values of the chunks using a hash function h. 3. Put all hashes into a hash table. A collision indicates matching chunks. OnWeb-based Plagiarism Analysis Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Alexander.Kleppe@medien.uni-weimar.de, Dennis.Braunsdorf@medien.uni-weimar.de, Christoph.Loessnitz@medien.uni-weimar.de, Sven.Meyer-zu-Eissen@medien.uni-weimar.de Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs. h=9154 h=2232 Our Web-based plagiarism analysis application takes a suspicious document from an a-priori unknown domain as input. Consequently, an unsupervised, domainindependent keyword extraction algorithm that takes a single document as input would be convenient, language independence being a plus. Matsuo and Ishizuka propose such a method; it is based on a 2-analysis of term co-occurence data [6]. 2.2 Query Generation: Focussing Search When keywords are extracted from the suspicious document, we employ a heuristic query generation procedure, which was first presented in [12]. Let K1 denote the set of keywords that have been extracted from a suspicious document. By adding synonyms, coordinate terms, and derivationally related forms, the set K1 is extended towards a setk2 [2].WithinK2 groups of words are identified by exploiting statistical knowledge about significant left and right neighbors, as well as adequate co-occurring words, yielding the set K3 [13]. Then, a sequence of queries is generated (and passed to search engines). This selection step is controlled by quantitative relevance feedback: Depending on the number of found documents more or less esoteric queries are generated. Note that such a control can be realized by a heuristic ordering of the set K3, which considers word group sizes and word frequency classes [14]. The result of this step is a candidate document collection C = {d1,..., dn}. 3 Plagiarism Analysis As outlined above, a document may be plagiarized in different forms. Consequently, several indications exist to suspect a document of plagiarism. An adoption of indications that are given in [9] is as follows. (1) Copied text. If text stems from a source that is known and it is not cited properly then this is an obvious case of plagiarism. (2) Bibliography. If the references in documents overlap significantly, the bibliography and other parts may be copied. A changing citing style may be a sign for plagiarism. (3) Change in writing style. A suspect change in the author s style may appear paragraph- or section-wise, e.g. between objective and subjective style, nominaland verbal style, brillant and baffling passages. (4) Change in formatting. In copy-and-paste plagiarism cases the formatting of the original document is inherited to pasted paragraphs, especially when content is copied from browsers to text processing programs. (5) Textual patchwork. If the line of argumentation throughout a document is consequently incoherent then the document may be a mixed plagiate, i.e. a compilation of different sources. suspicious document corpus documents Complexity: n documents in corpus, c chunks per document on average O(n c) operations (fingerprint generation, hash table operations) 33 [ ] c
31 External Plagiarism Detection Algorithms for Machines: Hashed Breakpoint Chunking 1. Partition each document into synchronized sections. 2. Compute hash values of the chunks using a hash function h. 3. Put all hashes into a hash table. A collision indicates matching chunks. OnWeb-based Plagiarism Analysis Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Alexander.Kleppe@medien.uni-weimar.de, Dennis.Braunsdorf@medien.uni-weimar.de, Christoph.Loessnitz@medien.uni-weimar.de, Sven.Meyer-zu-Eissen@medien.uni-weimar.de Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs. h=3294 h=7439 Our Web-based plagiarism analysis application takes a suspicious document from an a-priori unknown domain as input. Consequently, an unsupervised, domainindependent keyword extraction algorithm that takes a single document as input would be convenient, language independence being a plus. Matsuo and Ishizuka propose such a method; it is based on a 2-analysis of term co-occurence data [6]. 2.2 Query Generation: Focussing Search When keywords are extracted from the suspicious document, we employ a heuristic query generation procedure, which was first presented in [12]. Let K1 denote the set of keywords that have been extracted from a suspicious document. By adding synonyms, coordinate terms, and derivationally related forms, the set K1 is extended towards a setk2 [2].WithinK2 groups of words are identified by exploiting statistical knowledge about significant left and right neighbors, as well as adequate co-occurring words, yielding the set K3 [13]. Then, a sequence of queries is generated (and passed to search engines). This selection step is controlled by quantitative relevance feedback: Depending on the number of found documents more or less esoteric queries are generated. Note that such a control can be realized by a heuristic ordering of the set K3, which considers word group sizes and word frequency classes [14]. The result of this step is a candidate document collection C = {d1,..., dn}. 3 Plagiarism Analysis As outlined above, a document may be plagiarized in different forms. Consequently, several indications exist to suspect a document of plagiarism. An adoption of indications that are given in [9] is as follows. (1) Copied text. If text stems from a source that is known and it is not cited properly then this is an obvious case of plagiarism. (2) Bibliography. If the references in documents overlap significantly, the bibliography and other parts may be copied. A changing citing style may be a sign for plagiarism. (3) Change in writing style. A suspect change in the author s style may appear paragraph- or section-wise, e.g. between objective and subjective style, nominaland verbal style, brillant and baffling passages. (4) Change in formatting. In copy-and-paste plagiarism cases the formatting of the original document is inherited to pasted paragraphs, especially when content is copied from browsers to text processing programs. (5) Textual patchwork. If the line of argumentation throughout a document is consequently incoherent then the document may be a mixed plagiate, i.e. a compilation of different sources. suspicious document corpus documents Complexity: n documents in corpus, c chunks per document on average O(n c) operations (fingerprint generation, hash table operations) 34 [ ] c
32 External Plagiarism Detection Algorithms for Machines: Fuzzy-fingerprinting Standard hashing: Equal chunks yield the same hash key: h(c 1 ) = h(c 2 ) c 1, c 2 are equal. Problem: sensitive to smallest changes. Fuzzy-fingerprinting: Different but similar chunks yield the same hash key: h ϕ (c 1 ) = h ϕ (c 2 ) c 1, c 2 are similar with high probability. Approach: abstraction by reducing the alphabet, neglecting word order. Problem: similarity-sensitive hash functions suffer from a low recall. 36 [ ] c
33 External Plagiarism Detection Algorithms for Machines: Dot Plotting Geometric sequence analysis of all word 4-grams of two interesting documents. 37 [ ] c
34 External Plagiarism Detection Algorithms for Machines: Dot Plotting Geometric sequence analysis of all word 4-grams of two interesting documents. 38 [ ] c
35 External Plagiarism Detection Algorithms for Machines: Dot Plotting Geometric sequence analysis of all word 4-grams of two interesting documents. 39 [ ] c
36 External Plagiarism Detection Algorithms for Machines: Dot Plotting Level 1 (black): each dot indicates a common word 4-gram (hash collision). 40 [ ] c
37 External Plagiarism Detection Algorithms for Machines: Dot Plotting Level 2 (blue): neighbored common 4-grams are heuristically comprised. 41 [ ] c
38 External Plagiarism Detection Algorithms for Machines: Dot Plotting Level 3 (red): blue groups are merged by a cluster analysis (DBscan). 42 [ ] c
39 External Plagiarism Detection Algorithms for Machines: Dot Plotting The involved text of a cluster forms a plagiarism candidate. 43 [ ] c
40 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 44 [ ] c
41 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Check for problematic decisions: Citation analysis (can be problematic: consider an excuse citation in a footnote along with a completely reused text) Comparison of authors and co-authors 45 [ ] c
42 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 How to overcome the language barrier: Machine translation services Mapping into a concept space (ESA, CL-ESA) 46 [ ] c
43 The PAN Competition 47 [ ] c
44 The PAN Competition 2nd International Competition on Plagiarism Detection, PAN 2010 Facts: organized as CLEF 2010 Lab 18 groups from 12 countries participated 15 weeks of training and testing (March June) training corpus was the corpus PAN-PC-09 test corpus was the PAN-PC-10, a new version of last year s corpus. incidentally, the 1st competition was held at SEPLN 09. Task: Given a set of suspicious documents and a set of source documents, find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections. 49 [ ] c
45 The PAN Competition Plagiarism Corpus PAN-PC-10 1 Large-scale resource for the controlled evaluation of detection algorithms: documents (obtained from books from the Project Gutenberg 2 ) plagiarism cases (about 0-10 cases per document) [1] [2] PAN-PC-10 addresses a broad range of plagiarism situations by varying reasonably within the following parameters: 1. document length 2. document language 3. detection task 4. plagiarism case length 5. plagiarism case obfuscation 6. plagiarism case topic alignment 51 [ ] c
46 The PAN Competition PAN-PC-10 Document Statistics 100% documents Document length: 50% short (1-10 pages) 35% medium ( pages) 15% long ( pp.) Document language: 80% English 10% de 10% es Detection task: 70% external analysis 30% intrinsic analysis plagiarized unmodified (plagiarism source) plagiarized unmodified Plagiarism fraction per document [%] [ ] c
47 The PAN Competition PAN-PC-10 Plagiarism Case Statistics 100% plagiarism cases Plagiarism case length: 34% short ( words) 33% medium ( words) 33% long ( words) Plagiarism case obfuscation: 40% none 40% artificial 3 6% 4 14% 5 low obfuscation high obfuscation AMT de es [3] Artificial plagiarism: algorithmic obfuscation. [4] Simulated plagiarism: obfuscation via Amazon Mechanical Turk. [5] Cross-language plagiarism: obfuscation due to machine translation de en and es en. Plagiarism case topic alignment: 50% intra-topic 50% inter-topic 59 [ ] c
48 The PAN Competition Plagiarism Detection Results Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene Plagdet Plagdet combines precision, recall, and granularity. Precision and recall are well-known, yet not well-defined. Granularity measures the number of times a single plagiarism case has been detected. [Potthast et al., 2010] [ ] c
49 The PAN Competition Plagiarism Detection Results Recall Precision Granularity Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene [ ] c
50 Intrinsic Plagiarism Detection 62 [ ] c
51 Intrinsic Plagiarism Detection How Humans Spot Plagiarism When no corpus along with a powerful search engine is at hand [ ] c
52 Intrinsic Plagiarism Detection How Humans Spot Plagiarism When no corpus along with a powerful search engine is at hand... look for style changes check for peculiarities (orthographic mistakes, typographical habits) listen to the instincts (perhaps the most powerful technology ) 64 [ ] c
53 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 65 [ ] c
54 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 66 [ ] c
55 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 How large is the fraction θ of plagiarized text? document length analysis genre analysis (e.g. scientific article versus editorial) analysis of issuing institution 67 [ ] c
56 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 68 [ ] c
57 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 How to find text positions where plagiarism starts or ends? basic complex uniform length chunking (simple but naive) structural boundaries (chapters, paragraphs, tables, captions) topical boundaries (difficult but powerful) stylistic boundaries (best, but usually intractable) 69 [ ] c
58 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 70 [ ] c
59 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 The question of Stylometry: How to quantify writing style? basic complex structural features (paragraph lengths, use of tables, signatures) character-based lexical features (n-gram frequency, compression rate) word-based lexical features (readability, writing complexity) syntactic features (part of speech, function words) dialectic power and argumentation 71 [ ] c
60 Intrinsic Plagiarism Detection Algorithms for Machines: Style Features that Work Stylometric feature F Measure Flesch Reading Ease Score Average number of syllables per word Frequency of term: of Noun-Verb-Nountri-gram Noun-Noun-Verbtri-gram Verb-Noun-Nountri-gram Gunning Fog index Yule s K measure Flesch Kincaid grade level Average word length Noun-Preposition-ProperNountri-gram Honore s R measure Average word length Average word frequency class Consonant-Vowel-Consonanttri-gram Frequency of term: is Noun-Noun-CoordinatingConjunctiontri-gram NounPlural-Preposition-Determinertri-gram Determiner-NounPlural-Prepositiontri-gram Consonant-Vowel-Voweltri-gram [ ] c
61 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 74 [ ] c
62 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 One-class classification: we have an idea about positive examples only. : ( Density methods try to model a style feature s distribution. Boundary methods try to cluster text portions of similar style. Reconstruction methods quantify the style generation error under the average style model. 75 [ ] c
63 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany 0 ( ) 0 ( ) 1 ( )... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs style vectors of chunks 80 [ ] c
64 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs ( )... style vectors of chunks Style feature Assume that style features of outliers (plagiarized text) are uniformly distributed. 81 [ ] c
65 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany 0 ( ) 0 ( ) 1 ( )... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs style vectors of chunks Style feature Assume that style features of original text are Gaussian distributed. 82 [ ] c
66 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany 0 ( ) 0 ( ) 1 ( )... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs style vectors of chunks Style feature H 1 H 0 H 1 uncertainty interval uncertainty interval Compute maximum a-posteriori hypothesis under Naive Bayes. 83 [ ] c
67 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 84 [ ] c
68 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Since we are still unsecure... How to obtain additional evidence about authorship? Raise precision at the expense of recall. (analyze ROC characteristic) If sufficient text is available, apply authorship verification technology. (unmasking) 85 [ ] c
69 Vandalism Detection in Wikipedia 86 [ ] c
70 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, word existence 87 [ ] c
71 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, word existence 88 [ ] c
72 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, repetition 89 [ ] c
73 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, repetition 90 [ ] c
74 Vandalism Detection in Wikipedia Example: vulgarism, sentiment 91 [ ] c
75 Vandalism Detection in Wikipedia Example: vulgarism, sentiment 92 [ ] c
76 Vandalism Detection in Wikipedia Example: special chars, spacing 93 [ ] c
77 Vandalism Detection in Wikipedia Example: special chars, spacing 94 [ ] c
78 Vandalism Detection in Wikipedia Example: misguided helping 95 [ ] c
79 Vandalism Detection in Wikipedia Example: misguided helping 96 [ ] c
80 Vandalism Detection in Wikipedia Example: wrong facts, defamation 97 [ ] c
81 Vandalism Detection in Wikipedia Example: wrong facts, defamation 98 [ ] c
82 Vandalism Detection in Wikipedia Example: opinionated 99 [ ] c
83 Vandalism Detection in Wikipedia Example: opinionated 100 [ ] c
84 Vandalism Detection in Wikipedia Example: wrong facts, nonsense 101 [ ] c
85 Vandalism Detection in Wikipedia Example: wrong facts, nonsense 102 [ ] c
86 Vandalism Detection in Wikipedia Example: wrong facts, article history may suggest otherwise 103 [ ] c
87 Vandalism Detection in Wikipedia The Machine Learning Perspective The achievements of ML enfold their full power in discrimination situations. The tasks intrinsic plagiarism analysis authorship verification vandalism detection share a particular characteristic: they are one-class classification problems. Feature engineering plays an outstanding role. 105 [ ] c
88 Vandalism Detection in Wikipedia Two Types of Edit Features 106 [ ] c
89 Vandalism Detection in Wikipedia Two Types of Edit Features: Content-based Feature Description Character-level Features Capitalization Ratio of upper case chars to lower case chars (all chars) Distribution Kullback-Leibler divergence of the char distribution from the expectation Compressibility Compression rate of the edit differences Markup Ratio of new (changed) wikitext chars to all wikitext chars Word-level Features Vulgarism Pronouns Sentiment Frequency of vulgar words Frequency of personal pronouns Frequency of sentiment words Spelling and Grammar Features Word Existence Ratio of words that occur in an English dictionary Spelling Frequency (impact) of spelling errors Grammar Number of grammatical errors Edit Type Features Edit Type Replacement The edit is an insertion, deletion, modification, or a combination The article (a paragraph) is completely replaced, excluding its title 107 [ ] c
90 Vandalism Detection in Wikipedia Two Types of Edit Features: Context-based Feature Description Edit Comment Features Existence A comment was given Length Length of the comment Edit Time Features Edit time Successiveness Hour of the day the edit was made Logarithm of the time difference to the previous edit Article Revision History Features Revisions Number of revisions Regular Number of regular edits Article Trustworthiness Features Suspect Topic The article is on the list of often vandalized articles WikiTrust Values from the WikiTrust trust histogram Editor Reputation Features Anonymous Anonymous editor Reputation Scores that compute a user s reputation based on previous edits Registration Time the editor was registered with Wikipedia 108 [ ] c
91 The PAN Competition Continued 109 [ ] c
92 The PAN Competition Continued 1st International Competition on Wikipedia Vandalism Detection, PAN 2010 Facts: organized as CLEF 2010 Lab 9 groups from 5 countries participated, 5 groups from the USA 15 weeks of training and testing (March June) the corpus was newly created for the purpose of the competition Task: Given a set of edits on Wikipedia articles, distinguish ill-intentioned edits from well-intentioned edits. 111 [ ] c
93 The PAN Competition Continued Vandalism Corpus PAN-WVC-10 Large-scale resource for the controlled evaluation of detection algorithms: edits (sampled from a week s worth of Wikipedia edit logs) different edited articles (edit frequency resembles article importance) 2391 edits are vandalism (a 7% ratio is in concordance with the literature) The edits in PAN-WVC-10 have been reviewed by 753 human annotators, recruited at Amazon s Mechanical Turk: Each edit was reviewed by at least 3 different annotators. If the annotators did not agree, the edit was reviewed again by 3 other. If still less than 2/3 of the annotators agreed, 3 more annotators were asked. After 8 iterations only 70 edits remained in a tie, which proofed to be tough choices. 112 [ ] c
94 The PAN Competition Continued Vandalism Detection Results 113 [ ] c
95 The PAN Competition Continued Vandalism Detection Results 117 [ ] c
96 The PAN Competition Continued Vandalism Detection Results 122 [ ] c
97 The PAN Competition Continued Vandalism Detection Results 123 [ ] c
Candidate Document Retrieval for Web-scale Text Reuse Detection
Candidate Document Retrieval for Web-scale Text Reuse Detection Matthias Hagen Benno Stein Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de SPIRE 2011 Pisa, Italy October 19, 2011 Matthias Hagen,
More informationOverview of the 5th International Competition on Plagiarism Detection
Overview of the 5th International Competition on Plagiarism Detection Martin Potthast, Matthias Hagen, Tim Gollub, Martin Tippmann, Johannes Kiesel, and Benno Stein Bauhaus-Universität Weimar www.webis.de
More informationOverview of the 6th International Competition on Plagiarism Detection
Overview of the 6th International Competition on Plagiarism Detection Martin Potthast, Matthias Hagen, Anna Beyer, Matthias Busse, Martin Tippmann, and Benno Stein Bauhaus-Universität Weimar www.webis.de
More informationAnalyzing a Large Corpus of Crowdsourced Plagiarism
Analyzing a Large Corpus of Crowdsourced Plagiarism Master s Thesis Michael Völske Fakultät Medien Bauhaus-Universität Weimar 06.06.2013 Supervised by: Prof. Dr. Benno Stein Prof. Dr. Volker Rodehorst
More informationHashing and Merging Heuristics for Text Reuse Detection
Hashing and Merging Heuristics for Text Reuse Detection Notebook for PAN at CLEF-2014 Faisal Alvi 1, Mark Stevenson 2, Paul Clough 2 1 King Fahd University of Petroleum & Minerals, Saudi Arabia, 2 University
More informationSimilarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection
Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection Notebook for PAN at CLEF 2012 Arun kumar Jayapal The University of Sheffield, UK arunkumar.jeyapal@gmail.com Abstract
More informationExpanded N-Grams for Semantic Text Alignment
Expanded N-Grams for Semantic Text Alignment Notebook for PAN at CLEF 2014 Samira Abnar, Mostafa Dehghani, Hamed Zamani, and Azadeh Shakery School of ECE, College of Engineering, University of Tehran,
More informationDiverse Queries and Feature Type Selection for Plagiarism Discovery
Diverse Queries and Feature Type Selection for Plagiarism Discovery Notebook for PAN at CLEF 2013 Šimon Suchomel, Jan Kasprzak, and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,kas,brandejs}@fi.muni.cz
More informationImproving Synoptic Querying for Source Retrieval
Improving Synoptic Querying for Source Retrieval Notebook for PAN at CLEF 2015 Šimon Suchomel and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,brandejs}@fi.muni.cz Abstract Source
More informationNew Issues in Near-duplicate Detection
New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Motivation About 30% of the Web is redundant. [Fetterly 03, Broder
More informationQuery Session Detection as a Cascade
Query Session Detection as a Cascade Extended Abstract Matthias Hagen, Benno Stein, and Tino Rüb Bauhaus-Universität Weimar .@uni-weimar.de Abstract We propose a cascading method
More informationSupervised Ranking for Plagiarism Source Retrieval
Supervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2013 Kyle Williams, Hung-Hsuan Chen, and C. Lee Giles, Information Sciences and Technology Computer Science and Engineering Pennsylvania
More informationExploratory Search Missions for TREC Topics
Exploratory Search Missions for TREC Topics EuroHCIR 2013 Dublin, 1 August 2013 Martin Potthast Matthias Hagen Michael Völske Benno Stein Bauhaus-Universität Weimar www.webis.de Dataset for studying text
More informationPlagiarism Detection. Lucia D. Krisnawati
Plagiarism Detection Lucia D. Krisnawati Overview Introduction to Automatic Plagiarism Detection (PD) External PD Intrinsic PD Evaluation Framework Have You Heard, Seen or Used These? The Available PD
More informationWebis at the TREC 2012 Session Track
Webis at the TREC 2012 Session Track Extended Abstract for the Conference Notebook Matthias Hagen, Martin Potthast, Matthias Busse, Jakob Gomoll, Jannis Harder, and Benno Stein Bauhaus-Universität Weimar
More informationThree way search engine queries with multi-feature document comparison for plagiarism detection
Three way search engine queries with multi-feature document comparison for plagiarism detection Notebook for PAN at CLEF 2012 Šimon Suchomel, Jan Kasprzak, and Michal Brandejs Faculty of Informatics, Masaryk
More informationClassifying XML Documents by using Genre Features
Classifying XML Documents by using Genre Features 4th International Workshop on Text-based Information Retrieval in conjunction with DEXA 2007 Regensburg, Germany 3-7 September 2007 Malcolm Clark & Stuart
More informationA Language Independent Author Verifier Using Fuzzy C-Means Clustering
A Language Independent Author Verifier Using Fuzzy C-Means Clustering Notebook for PAN at CLEF 2014 Pashutan Modaresi 1,2 and Philipp Gross 1 1 pressrelations GmbH, Düsseldorf, Germany {pashutan.modaresi,
More informationCorrelation to Georgia Quality Core Curriculum
1. Strand: Oral Communication Topic: Listening/Speaking Standard: Adapts or changes oral language to fit the situation by following the rules of conversation with peers and adults. 2. Standard: Listens
More informationOPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection
OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection Notebook for PAN at CLEF 2017 Daniel Karaś, Martyna Śpiewak and Piotr Sobecki National Information Processing Institute, Poland {dkaras,
More informationStudent Guide for Usage of Criterion
Student Guide for Usage of Criterion Criterion is an Online Writing Evaluation service offered by ETS. It is a computer-based scoring program designed to help you think about your writing process and communicate
More informationPlagiarism Detection Using FP-Growth Algorithm
Northeastern University NLP Project Report Plagiarism Detection Using FP-Growth Algorithm Varun Nandu (nandu.v@husky.neu.edu) Suraj Nair (nair.sur@husky.neu.edu) Supervised by Dr. Lu Wang December 10,
More informationQuestion Answering Systems
Question Answering Systems An Introduction Potsdam, Germany, 14 July 2011 Saeedeh Momtazi Information Systems Group Outline 2 1 Introduction Outline 2 1 Introduction 2 History Outline 2 1 Introduction
More informationDocument Structure Analysis in Associative Patent Retrieval
Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,
More informationSupporting Scholarly Search with Keyqueries
Supporting Scholarly Search with Keyqueries Matthias Hagen Anna Beyer Tim Gollub Kristof Komlossy Benno Stein Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de @matthias_hagen ECIR 2016 Padova, Italy
More informationTIRA: Text based Information Retrieval Architecture
TIRA: Text based Information Retrieval Architecture Yunlu Ai, Robert Gerling, Marco Neumann, Christian Nitschke, Patrick Riehmann yunlu.ai@medien.uni-weimar.de, robert.gerling@medien.uni-weimar.de, marco.neumann@medien.uni-weimar.de,
More informationView and Submit an Assignment in Criterion
View and Submit an Assignment in Criterion Criterion is an Online Writing Evaluation service offered by ETS. It is a computer-based scoring program designed to help you think about your writing process
More informationTechnical Writing. Professional Communications
Technical Writing Professional Communications Overview Plan the document Write a draft Have someone review the draft Improve the document based on the review Plan, conduct, and evaluate a usability test
More informationTE Teacher s Edition PE Pupil Edition Page 1
Standard 4 WRITING: Writing Process Students discuss, list, and graphically organize writing ideas. They write clear, coherent, and focused essays. Students progress through the stages of the writing process
More informationReadyGEN Grade 1, 2016
A Correlation of ReadyGEN Grade 1, To the Grade 1 Introduction This document demonstrates how ReadyGEN, meets the new 2016 Oklahoma Academic Standards for. Correlation page references are to the Unit Module
More informationReadyGEN Grade 2, 2016
A Correlation of ReadyGEN Grade 2, 2016 To the Introduction This document demonstrates how meets the College and Career Ready. Correlation page references are to the Unit Module Teacher s Guides and are
More informationLeveraging Transitive Relations for Crowdsourced Joins*
Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,
More informationHow to do an On-Page SEO Analysis Table of Contents
How to do an On-Page SEO Analysis Table of Contents Step 1: Keyword Research/Identification Step 2: Quality of Content Step 3: Title Tags Step 4: H1 Headings Step 5: Meta Descriptions Step 6: Site Performance
More informationPerforming searches on Érudit
Performing searches on Érudit Table of Contents 1. Simple Search 3 2. Advanced search 2.1 Running a search 4 2.2 Operators and search fields 5 2.3 Filters 7 3. Search results 3.1. Refining your search
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationFinal Project Discussion. Adam Meyers Montclair State University
Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationDepartment of Electronic Engineering FINAL YEAR PROJECT REPORT
Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:
More informationThe Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014
The Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014 Notebook for PAN at CLEF 2014 Miguel A. Sanchez-Perez, Grigori Sidorov, Alexander Gelbukh Centro de Investigación en Computación,
More informationUsing Character n gram g Profiles. University of the Aegean
Intrinsic Plagiarism Detection Using Character n gram g Profiles Efstathios Stamatatos University of the Aegean Talk Layout Introduction The style change function Detecting plagiarism Evaluation Conclusions
More informationAssessment of Informational Materials (AIM) Tool. Funded by Alberta Enterprise and Education
Assessment of Informational Materials (AIM) Tool Funded by Alberta Enterprise and Education AIM Tool Factor to be Rated 1. Content a. Purpose b. Scope c. Summary and Review 2. Word and Sentence Complexity
More informationComputer-assisted English Research Writing Workshop. Adam Turner Director, English Writing Lab
Computer-assisted English Research Writing Workshop Adam Turner Director, English Writing Lab Website http://www.hanyangowl.org The PowerPoint and handout file will be sent to you if you email me at hanyangwritingcenter@gmail.com
More informationA Document-centered Approach to a Natural Language Music Search Engine
A Document-centered Approach to a Natural Language Music Search Engine Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, and Klaus Seyerlehner Dept. of Computational Perception, Johannes Kepler
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationFingerprint-based Similarity Search and its Applications
Fingerprint-based Similarity Search and its Applications Benno Stein Sven Meyer zu Eissen Bauhaus-Universität Weimar Abstract This paper introduces a new technology and tools from the field of text-based
More informationConceptual document indexing using a large scale semantic dictionary providing a concept hierarchy
Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence
More informationQuery Session Detection as a Cascade
Query Session Detection as a Cascade Matthias Hagen Benno Stein Tino Rüb Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de SIR 211 Dublin, Ireland April 18, 211 Hagen, Stein, Rüb Query Session Detection
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationFormal Figure Formatting Checklist
Formal Figure Formatting Checklist Presentation of Data Independent values are plotted on the x-axis, dependent values on the y-axis In general, no more than five curves to a figure (may be more or less
More information10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2
161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under
More informationCandidate Document Retrieval for Arabic-based Text Reuse Detection on the Web
Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web Leena Lulu, Boumediene Belkhouche, Saad Harous College of Information Technology United Arab Emirates University Al Ain, United
More informationFuzzy Multilevel Graph Embedding for Recognition, Indexing and Retrieval of Graphic Document Images
Cotutelle PhD thesis for Recognition, Indexing and Retrieval of Graphic Document Images presented by Muhammad Muzzamil LUQMAN mluqman@{univ-tours.fr, cvc.uab.es} Friday, 2 nd of March 2012 Directors of
More informationFORMAT & TYPING GUIDE
FORMAT & TYPING GUIDE for CEATI Reports updated March 2018 Timeline of a Report Pre-Writing As you sit down to write the report, ensure you begin by downloading the most recent CEATI template at www.ceati.com/rfps.
More informationWhat is this Song About?: Identification of Keywords in Bollywood Lyrics
What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationChapter DM:II. II. Cluster Analysis
Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1
More informationIntroduction to Text Mining. Hongning Wang
Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationCLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval
DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,
More informationUniversity of Amsterdam at INEX 2010: Ad hoc and Book Tracks
University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,
More informationInformation technology Learning, education and training Collaborative technology Collaborative learning communication. Part 1:
Provläsningsexemplar / Preview INTERNATIONAL STANDARD ISO/IEC 19780-1 Second edition 2015-10-15 Information technology Learning, education and training Collaborative technology Collaborative learning communication
More informationKnowledge Engineering with Semantic Web Technologies
This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) Knowledge Engineering with Semantic Web Technologies Lecture 5: Ontological Engineering 5.3 Ontology Learning
More informationEDITING & PROOFREADING CHECKLIST
EDITING & PROOFREADING CHECKLIST TABLE OF CONTENTS 1. Conduct a First Pass... 2 1.1. Ensure effective organization... 2 1.2. Check the flow and tone... 3 1.3. Check for correct mechanics... 4 1.4. Ensure
More informationCADIAL Search Engine at INEX
CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr
More informationKnowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More informationRPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???
@ INSIDE DEEPQA Managing complex unstructured data with UIMA Simon Ellis INTRODUCTION 22 nd November, 2013 WAT SON TECHNOLOGIES AND OPEN ARCHIT ECT URE QUEST ION ANSWERING PROFESSOR JIM HENDLER S IMON
More informationCHAPTER 2: LITERATURE REVIEW
CHAPTER 2: LITERATURE REVIEW 2.1. LITERATURE REVIEW Wan-Yu Lin [31] introduced a framework that identifies online plagiarism by exploiting lexical, syntactic and semantic features that includes duplication-gram,
More informationTechnical Paper Style Guide
AACE International Technical Paper Style Guide Prepared by the AACE International Technical Board Revised February 3, 2017 Contents 1. Purpose... 3 2. General Requirements... 3 2.1. Authorship... 3 2.2.
More informationThe Application of Closed Frequent Subtrees to Authorship Attribution
The Application of Closed Frequent Subtrees to Authorship Attribution Jonas Lindh Morén January 24, 2014 Bachelor s Thesis in Computing Science, 15 credits Supervisor at CS-UmU: Johanna Björklund Examiner:
More informationGRADES LANGUAGE! Live, Grades Correlated to the Oklahoma College- and Career-Ready English Language Arts Standards
GRADES 4 10 LANGUAGE! Live, Grades 4 10 Correlated to the Oklahoma College- and Career-Ready English Language Arts Standards GRADE 4 Standard 1: Speaking and Listening Students will speak and listen effectively
More informationA Content Based Image Retrieval System Based on Color Features
A Content Based Image Retrieval System Based on Features Irena Valova, University of Rousse Angel Kanchev, Department of Computer Systems and Technologies, Rousse, Bulgaria, Irena@ecs.ru.acad.bg Boris
More informationBuilding and Annotating Corpora of Collaborative Authoring in Wikipedia
Building and Annotating Corpora of Collaborative Authoring in Wikipedia Johannes Daxenberger, Oliver Ferschke and Iryna Gurevych Workshop: Building Corpora of Computer-Mediated Communication: Issues, Challenges,
More informationAs well as being faster to read, plain English sounds friendlier and more in keeping with the tone people expect from digital communications.
PLAIN ENGLISH Use this plain English checklist to audit your writing Plain English explained It won t surprise you that complex text slows down comprehension. But it might surprise you to learn that the
More informationEnhanced retrieval using semantic technologies:
Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationMath 3820 Project. 1 Typeset or handwritten? Guidelines
Math 3820 Project Guidelines Abstract These are some recommendations concerning the projects in Math 3820. 1 Typeset or handwritten? Although handwritten reports will be accepted, I strongly recommended
More informationAssessing the Quality of Natural Language Text
Assessing the Quality of Natural Language Text DC Research Ulm (RIC/AM) daniel.sonntag@dfki.de GI 2004 Agenda Introduction and Background to Text Quality Text Quality Dimensions Intrinsic Text Quality,
More informationAn Overview of Plagiarism Recognition Techniques. Xie Ruoyun. Received June 2018; revised July 2018
International Journal of Knowledge www.ijklp.org and Language Processing KLP International c2018 ISSN 2191-2734 Volume 9, Number 2, 2018 pp.1-19 An Overview of Plagiarism Recognition Techniques Xie Ruoyun
More informationGovernment of Karnataka Department of Technical Education Bengaluru
CIE- 25 Marks Government of Karnataka Department of Technical Education Bengaluru Course Title: Basic Web Design Lab Scheme (L:T:P) : 0:2:4 Total Contact Hours: 78 Type of Course: Tutorial and Practical
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationNatural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)
Natural Language Processing SoSe 2015 Question Answering Dr. Mariana Neves July 6th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Introduction History QA Architecture Outline 3 Introduction
More informationA MULTI-DIMENSIONAL DATA ORGANIZATION THAT ASSISTS IN THE PARSING AND PRODUCTION OF A SENTENCE
A MULTI-DIMENSIONAL DATA ORGANIZATION THAT ASSISTS IN THE PARSING AND PRODUCTION OF A SENTENCE W. Faris and K. Cheng Department of Computer Science University of Houston Houston, TX, 77204, USA http://www.cs.uh.edu
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationQuery Difficulty Prediction for Contextual Image Retrieval
Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.
More informationSemantically Driven Snippet Selection for Supporting Focused Web Searches
Semantically Driven Snippet Selection for Supporting Focused Web Searches IRAKLIS VARLAMIS Harokopio University of Athens Department of Informatics and Telematics, 89, Harokopou Street, 176 71, Athens,
More informationA Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2
A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor
More informationME 4054W: SENIOR DESIGN PROJECTS
ME 4054W: SENIOR DESIGN PROJECTS Week 3 Thursday Documenting Your Design Before we get started We have received feedback from an industry advisor that some of the students on their design team were not
More informationWikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals
Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals Lab Report for PAN at CLEF 2010 Santiago M. Mola Velasco Private coldwind@coldwind.org Abstract Wikipedia is an
More informationDetailed Format Instructions for Authors of the SPB Encyclopedia
Detailed Format Instructions for Authors of the SPB Encyclopedia General Formatting: When preparing the manuscript, the author should limit the use of control characters or special formatting. Use italics
More informationNatural Language Processing. SoSe Question Answering
Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation
More informationNatural Language Processing Is No Free Lunch
Natural Language Processing Is No Free Lunch STEFAN WAGNER UNIVERSITY OF STUTTGART, STUTTGART, GERMANY ntroduction o Impressive progress in NLP: OS with personal assistants like Siri or Cortan o Brief
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationStudent Performance Analysis Report. Subtest / Content Strand. Verbal Reasoning 40 Questions Results by Question # of Q
Student: BLAIR, SIERRA Test Date: 9/2018 (Fall) Grade: 7 Level: 6 Page: 1 Verbal Reasoning 40 Questions Results by Question # of Q Analogical Reasoning (Results presented in order of question difficulty)
More information