Detection of Text Plagiarism and Wikipedia Vandalism

Size: px
Start display at page:

Download "Detection of Text Plagiarism and Wikipedia Vandalism"

Transcription

1 Detection of Text Plagiarism and Wikipedia Vandalism Benno Stein Bauhaus-Universität Weimar Keynote at SEPLN, Valencia, 8. Sep. 2010

2 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin Potthast Matthias Hagen Nedim Lipka Peter Prettenhofer Tim Gollub Christin Gläser

3 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin Potthast Matthias Hagen Nedim Lipka Peter Prettenhofer Tim Gollub Christin Gläser

4 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin Potthast Matthias Hagen Nedim Lipka Peter Prettenhofer Tim Gollub Christin Gläser

5 Outline External Plagiarism Detection The PAN Competition Intrinsic Plagiarism Detection Vandalism Detection in Wikipedia The PAN Competition Continued + 5 [ ] c

6 Plagiarism is the practice of claiming, or implying, original authorship of someone else s written or creative work, in whole or in part, into one s own without adequate acknowledgment. [Wikipedia: Plagiarism, 2009] 8 [ ] c

7 ... better technology nowadays ; ) + 9 [ ] c

8 ... better technology nowadays ; ) +? 10 [ ] c

9 Is plagiarism a problem with respect to education? Is there a misunderstanding wrt. an evolving cultural technique? ( a service that exploits the unacknowledged wisdom of the crowd.) Can plagiarsim be detected by humans? Can plagiarsim be detected by machines? Should automatic plagiarism detection algorithms become standard? 11 [ ] c

10 Is plagiarism a problem with respect to education? Is there a misunderstanding wrt. an evolving cultural technique? ( a service that exploits the unacknowledged wisdom of the crowd.) Can plagiarsim be detected by humans? Can plagiarsim be detected by machines? Should automatic plagiarism detection algorithms become standard? 12 [ ] c

11 Is plagiarism a problem with respect to education? Is there a misunderstanding wrt. an evolving cultural technique? ( a service that exploits the unacknowledged wisdom of the crowd.) Can plagiarsim be detected by humans? Can plagiarsim be detected by machines? Should automatic plagiarism detection algorithms become standard? For several reasons we should say text reuse rather than plagiarism. 13 [ ] c

12 External Plagiarism Detection 14 [ ] c

13 External Plagiarism Detection How Humans Spot Plagiarism 15 [ ] c

14 External Plagiarism Detection How Humans Spot Plagiarism 16 [ ] c

15 External Plagiarism Detection How Humans Spot Plagiarism + Exploits human intuition for peculiar passages. + Exploits human experience to analyze the search engine results. + Is applied easily and in an ad-hoc manner. 17 [ ] c

16 External Plagiarism Detection How Humans Spot Plagiarism + Exploits human intuition for peculiar passages. + Exploits human experience to analyze the search engine results. + Is applied easily and in an ad-hoc manner. Cannot be done on large scale. Depends on (commercial) third-party services. Fails in the case of obfuscated / modified text. Cannot be used to find the reuse of structure or argumentation lines. 18 [ ] c

17 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 19 [ ] c

18 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 20 [ ] c

19 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Where are the crucial keywords? Check for noun phrases. Find orthographic mistakes. Consider word frequency classes. But, don t look in titles, captions, or headings. 21 [ ] c

20 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 22 [ ] c

21 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 23 [ ] c

22 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 24 [ ] c

23 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 25 [ ] c

24 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support 26 [ ] c

25 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Query keywords: information retrieval, query formulation, search session, user support The Query Cover Problem. 27 [ ] c

26 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Given: 1. A set W of keywords. 2. A query interface for a Web search engine S. 3. An upper bound k on the result list length. User over Ranking Todo: Find a family of queries Q covering W yielding at most k Web results. 29 [ ] c

27 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 30 [ ] c

28 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 basic complex Technology MD5 hashing Hashed breakpoint chunking Fuzzy-fingerprinting Dot plotting What can be detected Identity analysis for paragraphs Synchronized identity analysis for paragraphs Tolerant similarity analysis for paragraphs Sequences of word n-grams 31 [ ] c

29 External Plagiarism Detection Algorithms for Machines: Pairwise Comparison 1. Partition each document in meaningful sections, also called chunks. 2. Do a pairwise comparison using a similarity function ϕ. OnWeb-based Plagiarism Analysis Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Alexander.Kleppe@medien.uni-weimar.de, Dennis.Braunsdorf@medien.uni-weimar.de, Christoph.Loessnitz@medien.uni-weimar.de, Sven.Meyer-zu-Eissen@medien.uni-weimar.de Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs. ϕ Our Web-based plagiarism analysis application takes a suspicious document from an a-priori unknown domain as input. Consequently, an unsupervised, domainindependent keyword extraction algorithm that takes a single document as input would be convenient, language independence being a plus. Matsuo and Ishizuka propose such a method; it is based on a 2-analysis of term co-occurence data [6]. 2.2 Query Generation: Focussing Search When keywords are extracted from the suspicious document, we employ a heuristic query generation procedure, which was first presented in [12]. Let K1 denote the set of keywords that have been extracted from a suspicious document. By adding synonyms, coordinate terms, and derivationally related forms, the set K1 is extended towards a setk2 [2].WithinK2 groups of words are identified by exploiting statistical knowledge about significant left and right neighbors, as well as adequate co-occurring words, yielding the set K3 [13]. Then, a sequence of queries is generated (and passed to search engines). This selection step is controlled by quantitative relevance feedback: Depending on the number of found documents more or less esoteric queries are generated. Note that such a control can be realized by a heuristic ordering of the set K3, which considers word group sizes and word frequency classes [14]. The result of this step is a candidate document collection C = {d1,..., dn}. 3 Plagiarism Analysis As outlined above, a document may be plagiarized in different forms. Consequently, several indications exist to suspect a document of plagiarism. An adoption of indications that are given in [9] is as follows. (1) Copied text. If text stems from a source that is known and it is not cited properly then this is an obvious case of plagiarism. (2) Bibliography. If the references in documents overlap significantly, the bibliography and other parts may be copied. A changing citing style may be a sign for plagiarism. (3) Change in writing style. A suspect change in the author s style may appear paragraph- or section-wise, e.g. between objective and subjective style, nominaland verbal style, brillant and baffling passages. (4) Change in formatting. In copy-and-paste plagiarism cases the formatting of the original document is inherited to pasted paragraphs, especially when content is copied from browsers to text processing programs. (5) Textual patchwork. If the line of argumentation throughout a document is consequently incoherent then the document may be a mixed plagiate, i.e. a compilation of different sources. suspicious document corpus documents Complexity: n documents in corpus, c chunks per document on average O(n c 2 ) comparisons 32 [ ] c

30 External Plagiarism Detection Algorithms for Machines: MD5 Hashing 1. Partition each document into equidistant sections. 2. Compute hash values of the chunks using a hash function h. 3. Put all hashes into a hash table. A collision indicates matching chunks. OnWeb-based Plagiarism Analysis Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Alexander.Kleppe@medien.uni-weimar.de, Dennis.Braunsdorf@medien.uni-weimar.de, Christoph.Loessnitz@medien.uni-weimar.de, Sven.Meyer-zu-Eissen@medien.uni-weimar.de Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs. h=9154 h=2232 Our Web-based plagiarism analysis application takes a suspicious document from an a-priori unknown domain as input. Consequently, an unsupervised, domainindependent keyword extraction algorithm that takes a single document as input would be convenient, language independence being a plus. Matsuo and Ishizuka propose such a method; it is based on a 2-analysis of term co-occurence data [6]. 2.2 Query Generation: Focussing Search When keywords are extracted from the suspicious document, we employ a heuristic query generation procedure, which was first presented in [12]. Let K1 denote the set of keywords that have been extracted from a suspicious document. By adding synonyms, coordinate terms, and derivationally related forms, the set K1 is extended towards a setk2 [2].WithinK2 groups of words are identified by exploiting statistical knowledge about significant left and right neighbors, as well as adequate co-occurring words, yielding the set K3 [13]. Then, a sequence of queries is generated (and passed to search engines). This selection step is controlled by quantitative relevance feedback: Depending on the number of found documents more or less esoteric queries are generated. Note that such a control can be realized by a heuristic ordering of the set K3, which considers word group sizes and word frequency classes [14]. The result of this step is a candidate document collection C = {d1,..., dn}. 3 Plagiarism Analysis As outlined above, a document may be plagiarized in different forms. Consequently, several indications exist to suspect a document of plagiarism. An adoption of indications that are given in [9] is as follows. (1) Copied text. If text stems from a source that is known and it is not cited properly then this is an obvious case of plagiarism. (2) Bibliography. If the references in documents overlap significantly, the bibliography and other parts may be copied. A changing citing style may be a sign for plagiarism. (3) Change in writing style. A suspect change in the author s style may appear paragraph- or section-wise, e.g. between objective and subjective style, nominaland verbal style, brillant and baffling passages. (4) Change in formatting. In copy-and-paste plagiarism cases the formatting of the original document is inherited to pasted paragraphs, especially when content is copied from browsers to text processing programs. (5) Textual patchwork. If the line of argumentation throughout a document is consequently incoherent then the document may be a mixed plagiate, i.e. a compilation of different sources. suspicious document corpus documents Complexity: n documents in corpus, c chunks per document on average O(n c) operations (fingerprint generation, hash table operations) 33 [ ] c

31 External Plagiarism Detection Algorithms for Machines: Hashed Breakpoint Chunking 1. Partition each document into synchronized sections. 2. Compute hash values of the chunks using a hash function h. 3. Put all hashes into a hash table. A collision indicates matching chunks. OnWeb-based Plagiarism Analysis Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Alexander.Kleppe@medien.uni-weimar.de, Dennis.Braunsdorf@medien.uni-weimar.de, Christoph.Loessnitz@medien.uni-weimar.de, Sven.Meyer-zu-Eissen@medien.uni-weimar.de Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs. h=3294 h=7439 Our Web-based plagiarism analysis application takes a suspicious document from an a-priori unknown domain as input. Consequently, an unsupervised, domainindependent keyword extraction algorithm that takes a single document as input would be convenient, language independence being a plus. Matsuo and Ishizuka propose such a method; it is based on a 2-analysis of term co-occurence data [6]. 2.2 Query Generation: Focussing Search When keywords are extracted from the suspicious document, we employ a heuristic query generation procedure, which was first presented in [12]. Let K1 denote the set of keywords that have been extracted from a suspicious document. By adding synonyms, coordinate terms, and derivationally related forms, the set K1 is extended towards a setk2 [2].WithinK2 groups of words are identified by exploiting statistical knowledge about significant left and right neighbors, as well as adequate co-occurring words, yielding the set K3 [13]. Then, a sequence of queries is generated (and passed to search engines). This selection step is controlled by quantitative relevance feedback: Depending on the number of found documents more or less esoteric queries are generated. Note that such a control can be realized by a heuristic ordering of the set K3, which considers word group sizes and word frequency classes [14]. The result of this step is a candidate document collection C = {d1,..., dn}. 3 Plagiarism Analysis As outlined above, a document may be plagiarized in different forms. Consequently, several indications exist to suspect a document of plagiarism. An adoption of indications that are given in [9] is as follows. (1) Copied text. If text stems from a source that is known and it is not cited properly then this is an obvious case of plagiarism. (2) Bibliography. If the references in documents overlap significantly, the bibliography and other parts may be copied. A changing citing style may be a sign for plagiarism. (3) Change in writing style. A suspect change in the author s style may appear paragraph- or section-wise, e.g. between objective and subjective style, nominaland verbal style, brillant and baffling passages. (4) Change in formatting. In copy-and-paste plagiarism cases the formatting of the original document is inherited to pasted paragraphs, especially when content is copied from browsers to text processing programs. (5) Textual patchwork. If the line of argumentation throughout a document is consequently incoherent then the document may be a mixed plagiate, i.e. a compilation of different sources. suspicious document corpus documents Complexity: n documents in corpus, c chunks per document on average O(n c) operations (fingerprint generation, hash table operations) 34 [ ] c

32 External Plagiarism Detection Algorithms for Machines: Fuzzy-fingerprinting Standard hashing: Equal chunks yield the same hash key: h(c 1 ) = h(c 2 ) c 1, c 2 are equal. Problem: sensitive to smallest changes. Fuzzy-fingerprinting: Different but similar chunks yield the same hash key: h ϕ (c 1 ) = h ϕ (c 2 ) c 1, c 2 are similar with high probability. Approach: abstraction by reducing the alphabet, neglecting word order. Problem: similarity-sensitive hash functions suffer from a low recall. 36 [ ] c

33 External Plagiarism Detection Algorithms for Machines: Dot Plotting Geometric sequence analysis of all word 4-grams of two interesting documents. 37 [ ] c

34 External Plagiarism Detection Algorithms for Machines: Dot Plotting Geometric sequence analysis of all word 4-grams of two interesting documents. 38 [ ] c

35 External Plagiarism Detection Algorithms for Machines: Dot Plotting Geometric sequence analysis of all word 4-grams of two interesting documents. 39 [ ] c

36 External Plagiarism Detection Algorithms for Machines: Dot Plotting Level 1 (black): each dot indicates a common word 4-gram (hash collision). 40 [ ] c

37 External Plagiarism Detection Algorithms for Machines: Dot Plotting Level 2 (blue): neighbored common 4-grams are heuristically comprised. 41 [ ] c

38 External Plagiarism Detection Algorithms for Machines: Dot Plotting Level 3 (red): blue groups are merged by a cluster analysis (DBscan). 42 [ ] c

39 External Plagiarism Detection Algorithms for Machines: Dot Plotting The involved text of a cluster forms a plagiarism candidate. 43 [ ] c

40 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 44 [ ] c

41 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Check for problematic decisions: Citation analysis (can be problematic: consider an excuse citation in a footnote along with a completely reused text) Comparison of authors and co-authors 45 [ ] c

42 External Plagiarism Detection Algorithms for Machines Keyword Extraction from Document Heuristic Search in the WWW Document Selection Detailed Analysis Knowledge-based Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 How to overcome the language barrier: Machine translation services Mapping into a concept space (ESA, CL-ESA) 46 [ ] c

43 The PAN Competition 47 [ ] c

44 The PAN Competition 2nd International Competition on Plagiarism Detection, PAN 2010 Facts: organized as CLEF 2010 Lab 18 groups from 12 countries participated 15 weeks of training and testing (March June) training corpus was the corpus PAN-PC-09 test corpus was the PAN-PC-10, a new version of last year s corpus. incidentally, the 1st competition was held at SEPLN 09. Task: Given a set of suspicious documents and a set of source documents, find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections. 49 [ ] c

45 The PAN Competition Plagiarism Corpus PAN-PC-10 1 Large-scale resource for the controlled evaluation of detection algorithms: documents (obtained from books from the Project Gutenberg 2 ) plagiarism cases (about 0-10 cases per document) [1] [2] PAN-PC-10 addresses a broad range of plagiarism situations by varying reasonably within the following parameters: 1. document length 2. document language 3. detection task 4. plagiarism case length 5. plagiarism case obfuscation 6. plagiarism case topic alignment 51 [ ] c

46 The PAN Competition PAN-PC-10 Document Statistics 100% documents Document length: 50% short (1-10 pages) 35% medium ( pages) 15% long ( pp.) Document language: 80% English 10% de 10% es Detection task: 70% external analysis 30% intrinsic analysis plagiarized unmodified (plagiarism source) plagiarized unmodified Plagiarism fraction per document [%] [ ] c

47 The PAN Competition PAN-PC-10 Plagiarism Case Statistics 100% plagiarism cases Plagiarism case length: 34% short ( words) 33% medium ( words) 33% long ( words) Plagiarism case obfuscation: 40% none 40% artificial 3 6% 4 14% 5 low obfuscation high obfuscation AMT de es [3] Artificial plagiarism: algorithmic obfuscation. [4] Simulated plagiarism: obfuscation via Amazon Mechanical Turk. [5] Cross-language plagiarism: obfuscation due to machine translation de en and es en. Plagiarism case topic alignment: 50% intra-topic 50% inter-topic 59 [ ] c

48 The PAN Competition Plagiarism Detection Results Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene Plagdet Plagdet combines precision, recall, and granularity. Precision and recall are well-known, yet not well-defined. Granularity measures the number of times a single plagiarism case has been detected. [Potthast et al., 2010] [ ] c

49 The PAN Competition Plagiarism Detection Results Recall Precision Granularity Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene [ ] c

50 Intrinsic Plagiarism Detection 62 [ ] c

51 Intrinsic Plagiarism Detection How Humans Spot Plagiarism When no corpus along with a powerful search engine is at hand [ ] c

52 Intrinsic Plagiarism Detection How Humans Spot Plagiarism When no corpus along with a powerful search engine is at hand... look for style changes check for peculiarities (orthographic mistakes, typographical habits) listen to the instincts (perhaps the most powerful technology ) 64 [ ] c

53 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 65 [ ] c

54 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 66 [ ] c

55 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 How large is the fraction θ of plagiarized text? document length analysis genre analysis (e.g. scientific article versus editorial) analysis of issuing institution 67 [ ] c

56 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 68 [ ] c

57 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 How to find text positions where plagiarism starts or ends? basic complex uniform length chunking (simple but naive) structural boundaries (chapters, paragraphs, tables, captions) topical boundaries (difficult but powerful) stylistic boundaries (best, but usually intractable) 69 [ ] c

58 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 70 [ ] c

59 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 The question of Stylometry: How to quantify writing style? basic complex structural features (paragraph lengths, use of tables, signatures) character-based lexical features (n-gram frequency, compression rate) word-based lexical features (readability, writing complexity) syntactic features (part of speech, function words) dialectic power and argumentation 71 [ ] c

60 Intrinsic Plagiarism Detection Algorithms for Machines: Style Features that Work Stylometric feature F Measure Flesch Reading Ease Score Average number of syllables per word Frequency of term: of Noun-Verb-Nountri-gram Noun-Noun-Verbtri-gram Verb-Noun-Nountri-gram Gunning Fog index Yule s K measure Flesch Kincaid grade level Average word length Noun-Preposition-ProperNountri-gram Honore s R measure Average word length Average word frequency class Consonant-Vowel-Consonanttri-gram Frequency of term: is Noun-Noun-CoordinatingConjunctiontri-gram NounPlural-Preposition-Determinertri-gram Determiner-NounPlural-Prepositiontri-gram Consonant-Vowel-Voweltri-gram [ ] c

61 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 74 [ ] c

62 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 One-class classification: we have an idea about positive examples only. : ( Density methods try to model a style feature s distribution. Boundary methods try to cluster text portions of similar style. Reconstruction methods quantify the style generation error under the average style model. 75 [ ] c

63 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany 0 ( ) 0 ( ) 1 ( )... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs style vectors of chunks 80 [ ] c

64 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs ( )... style vectors of chunks Style feature Assume that style features of outliers (plagiarized text) are uniformly distributed. 81 [ ] c

65 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany 0 ( ) 0 ( ) 1 ( )... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs style vectors of chunks Style feature Assume that style features of original text are Gaussian distributed. 82 [ ] c

66 Intrinsic Plagiarism Detection Algorithms for Machines: Outlier Identification Alexander Kleppe, Dennis Braunsdorf, Christoph Loessnitz, Sven Meyer zu Eissen Bauhaus University Weimar Faculty of Media Media Systems D Weimar, Germany 0 ( ) 0 ( ) 1 ( )... Abstract The paper in hand presents a Web-based application for the analysis of text documents with respect to plagiarism. Aside from reporting experiences with standard algorithms, a new method for plagiarism analysis is introduced. Since well-known algorithms for plagiarism detection assume the existence of a candidate document collection against which a suspicious document can be compared, they are unsuited to spot potentially copied passages using only the input document. This kind of plagiarism remains undetected e.g. when paragraphs are copied from sources that are not available electronically. Our method is able to detect a change in writing style, and consequently to identify suspicious passages within a single document. Apart from contributing to solve the outlined problem, the presented method can also be used to focus a search for potentially original documents. Key words: plagiarism analysis, style analysis, focused search, chunking, Kullback-Leibler divergence 1 Introduction Plagiarism refers to the use of another s ideas, information, language, or writing, when done without proper acknowledgment of the original source [15]. Recently, the growing amount of digitally available documents contributes to the possibility to easily find and (partially) copy text documents given a specific topic: According to McCabe s plagiarism study on 18,000 students, about 50% of the students admit to plagiarize from Internet documents [7]. 1.1 Plagiarism Forms Plagiarism happens in several forms. Heintze distinguishes between the following textual relationships between documents: identical copy, edited copy, reorganized document, revisioned document, condensed/expanded document, documents that include portions of other documents. Moreover, unauthorized (partial) translations and documents that copy the structure of other documents can also be seen as plagiarized. Figure 1 depicts a taxonomy of plagiarism forms. Orthogonal to plagiarism forms are the underlying media: plagiarism may happen in articles, books or computer programs style vectors of chunks Style feature H 1 H 0 H 1 uncertainty interval uncertainty interval Compute maximum a-posteriori hypothesis under Naive Bayes. 83 [ ] c

67 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 84 [ ] c

68 Intrinsic Plagiarism Detection Algorithms for Machines Impurity Assessment Chunking Strategy Style Model Construction Outlier Identification Outlier Post-processing Step 1 Step 2 Step 3 Step 4 Step 5 Since we are still unsecure... How to obtain additional evidence about authorship? Raise precision at the expense of recall. (analyze ROC characteristic) If sufficient text is available, apply authorship verification technology. (unmasking) 85 [ ] c

69 Vandalism Detection in Wikipedia 86 [ ] c

70 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, word existence 87 [ ] c

71 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, word existence 88 [ ] c

72 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, repetition 89 [ ] c

73 Vandalism Detection in Wikipedia Example: size, capitalization, punctuation, repetition 90 [ ] c

74 Vandalism Detection in Wikipedia Example: vulgarism, sentiment 91 [ ] c

75 Vandalism Detection in Wikipedia Example: vulgarism, sentiment 92 [ ] c

76 Vandalism Detection in Wikipedia Example: special chars, spacing 93 [ ] c

77 Vandalism Detection in Wikipedia Example: special chars, spacing 94 [ ] c

78 Vandalism Detection in Wikipedia Example: misguided helping 95 [ ] c

79 Vandalism Detection in Wikipedia Example: misguided helping 96 [ ] c

80 Vandalism Detection in Wikipedia Example: wrong facts, defamation 97 [ ] c

81 Vandalism Detection in Wikipedia Example: wrong facts, defamation 98 [ ] c

82 Vandalism Detection in Wikipedia Example: opinionated 99 [ ] c

83 Vandalism Detection in Wikipedia Example: opinionated 100 [ ] c

84 Vandalism Detection in Wikipedia Example: wrong facts, nonsense 101 [ ] c

85 Vandalism Detection in Wikipedia Example: wrong facts, nonsense 102 [ ] c

86 Vandalism Detection in Wikipedia Example: wrong facts, article history may suggest otherwise 103 [ ] c

87 Vandalism Detection in Wikipedia The Machine Learning Perspective The achievements of ML enfold their full power in discrimination situations. The tasks intrinsic plagiarism analysis authorship verification vandalism detection share a particular characteristic: they are one-class classification problems. Feature engineering plays an outstanding role. 105 [ ] c

88 Vandalism Detection in Wikipedia Two Types of Edit Features 106 [ ] c

89 Vandalism Detection in Wikipedia Two Types of Edit Features: Content-based Feature Description Character-level Features Capitalization Ratio of upper case chars to lower case chars (all chars) Distribution Kullback-Leibler divergence of the char distribution from the expectation Compressibility Compression rate of the edit differences Markup Ratio of new (changed) wikitext chars to all wikitext chars Word-level Features Vulgarism Pronouns Sentiment Frequency of vulgar words Frequency of personal pronouns Frequency of sentiment words Spelling and Grammar Features Word Existence Ratio of words that occur in an English dictionary Spelling Frequency (impact) of spelling errors Grammar Number of grammatical errors Edit Type Features Edit Type Replacement The edit is an insertion, deletion, modification, or a combination The article (a paragraph) is completely replaced, excluding its title 107 [ ] c

90 Vandalism Detection in Wikipedia Two Types of Edit Features: Context-based Feature Description Edit Comment Features Existence A comment was given Length Length of the comment Edit Time Features Edit time Successiveness Hour of the day the edit was made Logarithm of the time difference to the previous edit Article Revision History Features Revisions Number of revisions Regular Number of regular edits Article Trustworthiness Features Suspect Topic The article is on the list of often vandalized articles WikiTrust Values from the WikiTrust trust histogram Editor Reputation Features Anonymous Anonymous editor Reputation Scores that compute a user s reputation based on previous edits Registration Time the editor was registered with Wikipedia 108 [ ] c

91 The PAN Competition Continued 109 [ ] c

92 The PAN Competition Continued 1st International Competition on Wikipedia Vandalism Detection, PAN 2010 Facts: organized as CLEF 2010 Lab 9 groups from 5 countries participated, 5 groups from the USA 15 weeks of training and testing (March June) the corpus was newly created for the purpose of the competition Task: Given a set of edits on Wikipedia articles, distinguish ill-intentioned edits from well-intentioned edits. 111 [ ] c

93 The PAN Competition Continued Vandalism Corpus PAN-WVC-10 Large-scale resource for the controlled evaluation of detection algorithms: edits (sampled from a week s worth of Wikipedia edit logs) different edited articles (edit frequency resembles article importance) 2391 edits are vandalism (a 7% ratio is in concordance with the literature) The edits in PAN-WVC-10 have been reviewed by 753 human annotators, recruited at Amazon s Mechanical Turk: Each edit was reviewed by at least 3 different annotators. If the annotators did not agree, the edit was reviewed again by 3 other. If still less than 2/3 of the annotators agreed, 3 more annotators were asked. After 8 iterations only 70 edits remained in a tie, which proofed to be tough choices. 112 [ ] c

94 The PAN Competition Continued Vandalism Detection Results 113 [ ] c

95 The PAN Competition Continued Vandalism Detection Results 117 [ ] c

96 The PAN Competition Continued Vandalism Detection Results 122 [ ] c

97 The PAN Competition Continued Vandalism Detection Results 123 [ ] c

Candidate Document Retrieval for Web-scale Text Reuse Detection

Candidate Document Retrieval for Web-scale Text Reuse Detection Candidate Document Retrieval for Web-scale Text Reuse Detection Matthias Hagen Benno Stein Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de SPIRE 2011 Pisa, Italy October 19, 2011 Matthias Hagen,

More information

Overview of the 5th International Competition on Plagiarism Detection

Overview of the 5th International Competition on Plagiarism Detection Overview of the 5th International Competition on Plagiarism Detection Martin Potthast, Matthias Hagen, Tim Gollub, Martin Tippmann, Johannes Kiesel, and Benno Stein Bauhaus-Universität Weimar www.webis.de

More information

Overview of the 6th International Competition on Plagiarism Detection

Overview of the 6th International Competition on Plagiarism Detection Overview of the 6th International Competition on Plagiarism Detection Martin Potthast, Matthias Hagen, Anna Beyer, Matthias Busse, Martin Tippmann, and Benno Stein Bauhaus-Universität Weimar www.webis.de

More information

Analyzing a Large Corpus of Crowdsourced Plagiarism

Analyzing a Large Corpus of Crowdsourced Plagiarism Analyzing a Large Corpus of Crowdsourced Plagiarism Master s Thesis Michael Völske Fakultät Medien Bauhaus-Universität Weimar 06.06.2013 Supervised by: Prof. Dr. Benno Stein Prof. Dr. Volker Rodehorst

More information

Hashing and Merging Heuristics for Text Reuse Detection

Hashing and Merging Heuristics for Text Reuse Detection Hashing and Merging Heuristics for Text Reuse Detection Notebook for PAN at CLEF-2014 Faisal Alvi 1, Mark Stevenson 2, Paul Clough 2 1 King Fahd University of Petroleum & Minerals, Saudi Arabia, 2 University

More information

Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection

Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection Notebook for PAN at CLEF 2012 Arun kumar Jayapal The University of Sheffield, UK arunkumar.jeyapal@gmail.com Abstract

More information

Expanded N-Grams for Semantic Text Alignment

Expanded N-Grams for Semantic Text Alignment Expanded N-Grams for Semantic Text Alignment Notebook for PAN at CLEF 2014 Samira Abnar, Mostafa Dehghani, Hamed Zamani, and Azadeh Shakery School of ECE, College of Engineering, University of Tehran,

More information

Diverse Queries and Feature Type Selection for Plagiarism Discovery

Diverse Queries and Feature Type Selection for Plagiarism Discovery Diverse Queries and Feature Type Selection for Plagiarism Discovery Notebook for PAN at CLEF 2013 Šimon Suchomel, Jan Kasprzak, and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,kas,brandejs}@fi.muni.cz

More information

Improving Synoptic Querying for Source Retrieval

Improving Synoptic Querying for Source Retrieval Improving Synoptic Querying for Source Retrieval Notebook for PAN at CLEF 2015 Šimon Suchomel and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,brandejs}@fi.muni.cz Abstract Source

More information

New Issues in Near-duplicate Detection

New Issues in Near-duplicate Detection New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Motivation About 30% of the Web is redundant. [Fetterly 03, Broder

More information

Query Session Detection as a Cascade

Query Session Detection as a Cascade Query Session Detection as a Cascade Extended Abstract Matthias Hagen, Benno Stein, and Tino Rüb Bauhaus-Universität Weimar .@uni-weimar.de Abstract We propose a cascading method

More information

Supervised Ranking for Plagiarism Source Retrieval

Supervised Ranking for Plagiarism Source Retrieval Supervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2013 Kyle Williams, Hung-Hsuan Chen, and C. Lee Giles, Information Sciences and Technology Computer Science and Engineering Pennsylvania

More information

Exploratory Search Missions for TREC Topics

Exploratory Search Missions for TREC Topics Exploratory Search Missions for TREC Topics EuroHCIR 2013 Dublin, 1 August 2013 Martin Potthast Matthias Hagen Michael Völske Benno Stein Bauhaus-Universität Weimar www.webis.de Dataset for studying text

More information

Plagiarism Detection. Lucia D. Krisnawati

Plagiarism Detection. Lucia D. Krisnawati Plagiarism Detection Lucia D. Krisnawati Overview Introduction to Automatic Plagiarism Detection (PD) External PD Intrinsic PD Evaluation Framework Have You Heard, Seen or Used These? The Available PD

More information

Webis at the TREC 2012 Session Track

Webis at the TREC 2012 Session Track Webis at the TREC 2012 Session Track Extended Abstract for the Conference Notebook Matthias Hagen, Martin Potthast, Matthias Busse, Jakob Gomoll, Jannis Harder, and Benno Stein Bauhaus-Universität Weimar

More information

Three way search engine queries with multi-feature document comparison for plagiarism detection

Three way search engine queries with multi-feature document comparison for plagiarism detection Three way search engine queries with multi-feature document comparison for plagiarism detection Notebook for PAN at CLEF 2012 Šimon Suchomel, Jan Kasprzak, and Michal Brandejs Faculty of Informatics, Masaryk

More information

Classifying XML Documents by using Genre Features

Classifying XML Documents by using Genre Features Classifying XML Documents by using Genre Features 4th International Workshop on Text-based Information Retrieval in conjunction with DEXA 2007 Regensburg, Germany 3-7 September 2007 Malcolm Clark & Stuart

More information

A Language Independent Author Verifier Using Fuzzy C-Means Clustering

A Language Independent Author Verifier Using Fuzzy C-Means Clustering A Language Independent Author Verifier Using Fuzzy C-Means Clustering Notebook for PAN at CLEF 2014 Pashutan Modaresi 1,2 and Philipp Gross 1 1 pressrelations GmbH, Düsseldorf, Germany {pashutan.modaresi,

More information

Correlation to Georgia Quality Core Curriculum

Correlation to Georgia Quality Core Curriculum 1. Strand: Oral Communication Topic: Listening/Speaking Standard: Adapts or changes oral language to fit the situation by following the rules of conversation with peers and adults. 2. Standard: Listens

More information

OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection

OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection Notebook for PAN at CLEF 2017 Daniel Karaś, Martyna Śpiewak and Piotr Sobecki National Information Processing Institute, Poland {dkaras,

More information

Student Guide for Usage of Criterion

Student Guide for Usage of Criterion Student Guide for Usage of Criterion Criterion is an Online Writing Evaluation service offered by ETS. It is a computer-based scoring program designed to help you think about your writing process and communicate

More information

Plagiarism Detection Using FP-Growth Algorithm

Plagiarism Detection Using FP-Growth Algorithm Northeastern University NLP Project Report Plagiarism Detection Using FP-Growth Algorithm Varun Nandu (nandu.v@husky.neu.edu) Suraj Nair (nair.sur@husky.neu.edu) Supervised by Dr. Lu Wang December 10,

More information

Question Answering Systems

Question Answering Systems Question Answering Systems An Introduction Potsdam, Germany, 14 July 2011 Saeedeh Momtazi Information Systems Group Outline 2 1 Introduction Outline 2 1 Introduction 2 History Outline 2 1 Introduction

More information

Document Structure Analysis in Associative Patent Retrieval

Document Structure Analysis in Associative Patent Retrieval Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,

More information

Supporting Scholarly Search with Keyqueries

Supporting Scholarly Search with Keyqueries Supporting Scholarly Search with Keyqueries Matthias Hagen Anna Beyer Tim Gollub Kristof Komlossy Benno Stein Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de @matthias_hagen ECIR 2016 Padova, Italy

More information

TIRA: Text based Information Retrieval Architecture

TIRA: Text based Information Retrieval Architecture TIRA: Text based Information Retrieval Architecture Yunlu Ai, Robert Gerling, Marco Neumann, Christian Nitschke, Patrick Riehmann yunlu.ai@medien.uni-weimar.de, robert.gerling@medien.uni-weimar.de, marco.neumann@medien.uni-weimar.de,

More information

View and Submit an Assignment in Criterion

View and Submit an Assignment in Criterion View and Submit an Assignment in Criterion Criterion is an Online Writing Evaluation service offered by ETS. It is a computer-based scoring program designed to help you think about your writing process

More information

Technical Writing. Professional Communications

Technical Writing. Professional Communications Technical Writing Professional Communications Overview Plan the document Write a draft Have someone review the draft Improve the document based on the review Plan, conduct, and evaluate a usability test

More information

TE Teacher s Edition PE Pupil Edition Page 1

TE Teacher s Edition PE Pupil Edition Page 1 Standard 4 WRITING: Writing Process Students discuss, list, and graphically organize writing ideas. They write clear, coherent, and focused essays. Students progress through the stages of the writing process

More information

ReadyGEN Grade 1, 2016

ReadyGEN Grade 1, 2016 A Correlation of ReadyGEN Grade 1, To the Grade 1 Introduction This document demonstrates how ReadyGEN, meets the new 2016 Oklahoma Academic Standards for. Correlation page references are to the Unit Module

More information

ReadyGEN Grade 2, 2016

ReadyGEN Grade 2, 2016 A Correlation of ReadyGEN Grade 2, 2016 To the Introduction This document demonstrates how meets the College and Career Ready. Correlation page references are to the Unit Module Teacher s Guides and are

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

How to do an On-Page SEO Analysis Table of Contents

How to do an On-Page SEO Analysis Table of Contents How to do an On-Page SEO Analysis Table of Contents Step 1: Keyword Research/Identification Step 2: Quality of Content Step 3: Title Tags Step 4: H1 Headings Step 5: Meta Descriptions Step 6: Site Performance

More information

Performing searches on Érudit

Performing searches on Érudit Performing searches on Érudit Table of Contents 1. Simple Search 3 2. Advanced search 2.1 Running a search 4 2.2 Operators and search fields 5 2.3 Filters 7 3. Search results 3.1. Refining your search

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

The Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014

The Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014 The Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014 Notebook for PAN at CLEF 2014 Miguel A. Sanchez-Perez, Grigori Sidorov, Alexander Gelbukh Centro de Investigación en Computación,

More information

Using Character n gram g Profiles. University of the Aegean

Using Character n gram g Profiles. University of the Aegean Intrinsic Plagiarism Detection Using Character n gram g Profiles Efstathios Stamatatos University of the Aegean Talk Layout Introduction The style change function Detecting plagiarism Evaluation Conclusions

More information

Assessment of Informational Materials (AIM) Tool. Funded by Alberta Enterprise and Education

Assessment of Informational Materials (AIM) Tool. Funded by Alberta Enterprise and Education Assessment of Informational Materials (AIM) Tool Funded by Alberta Enterprise and Education AIM Tool Factor to be Rated 1. Content a. Purpose b. Scope c. Summary and Review 2. Word and Sentence Complexity

More information

Computer-assisted English Research Writing Workshop. Adam Turner Director, English Writing Lab

Computer-assisted English Research Writing Workshop. Adam Turner Director, English Writing Lab Computer-assisted English Research Writing Workshop Adam Turner Director, English Writing Lab Website http://www.hanyangowl.org The PowerPoint and handout file will be sent to you if you email me at hanyangwritingcenter@gmail.com

More information

A Document-centered Approach to a Natural Language Music Search Engine

A Document-centered Approach to a Natural Language Music Search Engine A Document-centered Approach to a Natural Language Music Search Engine Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, and Klaus Seyerlehner Dept. of Computational Perception, Johannes Kepler

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Fingerprint-based Similarity Search and its Applications

Fingerprint-based Similarity Search and its Applications Fingerprint-based Similarity Search and its Applications Benno Stein Sven Meyer zu Eissen Bauhaus-Universität Weimar Abstract This paper introduces a new technology and tools from the field of text-based

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

Query Session Detection as a Cascade

Query Session Detection as a Cascade Query Session Detection as a Cascade Matthias Hagen Benno Stein Tino Rüb Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de SIR 211 Dublin, Ireland April 18, 211 Hagen, Stein, Rüb Query Session Detection

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Formal Figure Formatting Checklist

Formal Figure Formatting Checklist Formal Figure Formatting Checklist Presentation of Data Independent values are plotted on the x-axis, dependent values on the y-axis In general, no more than five curves to a figure (may be more or less

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web

Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web Leena Lulu, Boumediene Belkhouche, Saad Harous College of Information Technology United Arab Emirates University Al Ain, United

More information

Fuzzy Multilevel Graph Embedding for Recognition, Indexing and Retrieval of Graphic Document Images

Fuzzy Multilevel Graph Embedding for Recognition, Indexing and Retrieval of Graphic Document Images Cotutelle PhD thesis for Recognition, Indexing and Retrieval of Graphic Document Images presented by Muhammad Muzzamil LUQMAN mluqman@{univ-tours.fr, cvc.uab.es} Friday, 2 nd of March 2012 Directors of

More information

FORMAT & TYPING GUIDE

FORMAT & TYPING GUIDE FORMAT & TYPING GUIDE for CEATI Reports updated March 2018 Timeline of a Report Pre-Writing As you sit down to write the report, ensure you begin by downloading the most recent CEATI template at www.ceati.com/rfps.

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

Information technology Learning, education and training Collaborative technology Collaborative learning communication. Part 1:

Information technology Learning, education and training Collaborative technology Collaborative learning communication. Part 1: Provläsningsexemplar / Preview INTERNATIONAL STANDARD ISO/IEC 19780-1 Second edition 2015-10-15 Information technology Learning, education and training Collaborative technology Collaborative learning communication

More information

Knowledge Engineering with Semantic Web Technologies

Knowledge Engineering with Semantic Web Technologies This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) Knowledge Engineering with Semantic Web Technologies Lecture 5: Ontological Engineering 5.3 Ontology Learning

More information

EDITING & PROOFREADING CHECKLIST

EDITING & PROOFREADING CHECKLIST EDITING & PROOFREADING CHECKLIST TABLE OF CONTENTS 1. Conduct a First Pass... 2 1.1. Ensure effective organization... 2 1.2. Check the flow and tone... 3 1.3. Check for correct mechanics... 4 1.4. Ensure

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ??? @ INSIDE DEEPQA Managing complex unstructured data with UIMA Simon Ellis INTRODUCTION 22 nd November, 2013 WAT SON TECHNOLOGIES AND OPEN ARCHIT ECT URE QUEST ION ANSWERING PROFESSOR JIM HENDLER S IMON

More information

CHAPTER 2: LITERATURE REVIEW

CHAPTER 2: LITERATURE REVIEW CHAPTER 2: LITERATURE REVIEW 2.1. LITERATURE REVIEW Wan-Yu Lin [31] introduced a framework that identifies online plagiarism by exploiting lexical, syntactic and semantic features that includes duplication-gram,

More information

Technical Paper Style Guide

Technical Paper Style Guide AACE International Technical Paper Style Guide Prepared by the AACE International Technical Board Revised February 3, 2017 Contents 1. Purpose... 3 2. General Requirements... 3 2.1. Authorship... 3 2.2.

More information

The Application of Closed Frequent Subtrees to Authorship Attribution

The Application of Closed Frequent Subtrees to Authorship Attribution The Application of Closed Frequent Subtrees to Authorship Attribution Jonas Lindh Morén January 24, 2014 Bachelor s Thesis in Computing Science, 15 credits Supervisor at CS-UmU: Johanna Björklund Examiner:

More information

GRADES LANGUAGE! Live, Grades Correlated to the Oklahoma College- and Career-Ready English Language Arts Standards

GRADES LANGUAGE! Live, Grades Correlated to the Oklahoma College- and Career-Ready English Language Arts Standards GRADES 4 10 LANGUAGE! Live, Grades 4 10 Correlated to the Oklahoma College- and Career-Ready English Language Arts Standards GRADE 4 Standard 1: Speaking and Listening Students will speak and listen effectively

More information

A Content Based Image Retrieval System Based on Color Features

A Content Based Image Retrieval System Based on Color Features A Content Based Image Retrieval System Based on Features Irena Valova, University of Rousse Angel Kanchev, Department of Computer Systems and Technologies, Rousse, Bulgaria, Irena@ecs.ru.acad.bg Boris

More information

Building and Annotating Corpora of Collaborative Authoring in Wikipedia

Building and Annotating Corpora of Collaborative Authoring in Wikipedia Building and Annotating Corpora of Collaborative Authoring in Wikipedia Johannes Daxenberger, Oliver Ferschke and Iryna Gurevych Workshop: Building Corpora of Computer-Mediated Communication: Issues, Challenges,

More information

As well as being faster to read, plain English sounds friendlier and more in keeping with the tone people expect from digital communications.

As well as being faster to read, plain English sounds friendlier and more in keeping with the tone people expect from digital communications. PLAIN ENGLISH Use this plain English checklist to audit your writing Plain English explained It won t surprise you that complex text slows down comprehension. But it might surprise you to learn that the

More information

Enhanced retrieval using semantic technologies:

Enhanced retrieval using semantic technologies: Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Math 3820 Project. 1 Typeset or handwritten? Guidelines

Math 3820 Project. 1 Typeset or handwritten? Guidelines Math 3820 Project Guidelines Abstract These are some recommendations concerning the projects in Math 3820. 1 Typeset or handwritten? Although handwritten reports will be accepted, I strongly recommended

More information

Assessing the Quality of Natural Language Text

Assessing the Quality of Natural Language Text Assessing the Quality of Natural Language Text DC Research Ulm (RIC/AM) daniel.sonntag@dfki.de GI 2004 Agenda Introduction and Background to Text Quality Text Quality Dimensions Intrinsic Text Quality,

More information

An Overview of Plagiarism Recognition Techniques. Xie Ruoyun. Received June 2018; revised July 2018

An Overview of Plagiarism Recognition Techniques. Xie Ruoyun. Received June 2018; revised July 2018 International Journal of Knowledge www.ijklp.org and Language Processing KLP International c2018 ISSN 2191-2734 Volume 9, Number 2, 2018 pp.1-19 An Overview of Plagiarism Recognition Techniques Xie Ruoyun

More information

Government of Karnataka Department of Technical Education Bengaluru

Government of Karnataka Department of Technical Education Bengaluru CIE- 25 Marks Government of Karnataka Department of Technical Education Bengaluru Course Title: Basic Web Design Lab Scheme (L:T:P) : 0:2:4 Total Contact Hours: 78 Type of Course: Tutorial and Practical

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Question Answering Dr. Mariana Neves July 6th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Introduction History QA Architecture Outline 3 Introduction

More information

A MULTI-DIMENSIONAL DATA ORGANIZATION THAT ASSISTS IN THE PARSING AND PRODUCTION OF A SENTENCE

A MULTI-DIMENSIONAL DATA ORGANIZATION THAT ASSISTS IN THE PARSING AND PRODUCTION OF A SENTENCE A MULTI-DIMENSIONAL DATA ORGANIZATION THAT ASSISTS IN THE PARSING AND PRODUCTION OF A SENTENCE W. Faris and K. Cheng Department of Computer Science University of Houston Houston, TX, 77204, USA http://www.cs.uh.edu

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information

Semantically Driven Snippet Selection for Supporting Focused Web Searches

Semantically Driven Snippet Selection for Supporting Focused Web Searches Semantically Driven Snippet Selection for Supporting Focused Web Searches IRAKLIS VARLAMIS Harokopio University of Athens Department of Informatics and Telematics, 89, Harokopou Street, 176 71, Athens,

More information

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor

More information

ME 4054W: SENIOR DESIGN PROJECTS

ME 4054W: SENIOR DESIGN PROJECTS ME 4054W: SENIOR DESIGN PROJECTS Week 3 Thursday Documenting Your Design Before we get started We have received feedback from an industry advisor that some of the students on their design team were not

More information

Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals

Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals Lab Report for PAN at CLEF 2010 Santiago M. Mola Velasco Private coldwind@coldwind.org Abstract Wikipedia is an

More information

Detailed Format Instructions for Authors of the SPB Encyclopedia

Detailed Format Instructions for Authors of the SPB Encyclopedia Detailed Format Instructions for Authors of the SPB Encyclopedia General Formatting: When preparing the manuscript, the author should limit the use of control characters or special formatting. Use italics

More information

Natural Language Processing. SoSe Question Answering

Natural Language Processing. SoSe Question Answering Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation

More information

Natural Language Processing Is No Free Lunch

Natural Language Processing Is No Free Lunch Natural Language Processing Is No Free Lunch STEFAN WAGNER UNIVERSITY OF STUTTGART, STUTTGART, GERMANY ntroduction o Impressive progress in NLP: OS with personal assistants like Siri or Cortan o Brief

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Student Performance Analysis Report. Subtest / Content Strand. Verbal Reasoning 40 Questions Results by Question # of Q

Student Performance Analysis Report. Subtest / Content Strand. Verbal Reasoning 40 Questions Results by Question # of Q Student: BLAIR, SIERRA Test Date: 9/2018 (Fall) Grade: 7 Level: 6 Page: 1 Verbal Reasoning 40 Questions Results by Question # of Q Analogical Reasoning (Results presented in order of question difficulty)

More information