Analyzing a Large Corpus of Crowdsourced Plagiarism
|
|
- Tyrone Terry
- 5 years ago
- Views:
Transcription
1 Analyzing a Large Corpus of Crowdsourced Plagiarism Master s Thesis Michael Völske Fakultät Medien Bauhaus-Universität Weimar Supervised by: Prof. Dr. Benno Stein Prof. Dr. Volker Rodehorst Advisors: Dr. Steven Burrows Dr. Matthias Hagen Dr. Martin Potthast
2 Analyzing a Large Corpus of Crowdsourced Plagiarism Master s Thesis Outline Introduction The Webis-TRC-12 Dataset Categorizing Crowdsourced Text Reuse Search Missions For Source Retrieval Summary
3 Introduction Modeling Text Reuse From the Web Search Copy & Paste Paraphrase Search I m Feeling Lucky [Potthast 2011] 3 [ ] Michael Völske 2013
4 Introduction Previous Plagiarism Corpora Source Selection Automatic Excerpt Automatic Modification Insertion Automatically generated artificial plagiarism PAN-PC-10/11: > documents; > plagiarism cases Automatic transformations do not preserve semantics 4 [ ] Michael Völske 2013
5 The Webis-TRC-12 Dataset 5 [ ] Michael Völske 2013
6 The Webis-TRC-12 Dataset Construction Overview Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 6 [ ] Michael Völske 2013
7 The Webis-TRC-12 Dataset Construction Overview: Topics Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 7 [ ] Michael Völske 2013
8 The Webis-TRC-12 Dataset Construction Overview: Topics Example topic: Obama s family. Write about President Barack Obama s family history, including genealogy, national origins, places and dates of birth, etc. Where did Barack Obama s parents and grandparents come from? Also include a brief biography of Obama s mother. Based on TREC Web Track topics topics, 297 essays Target essay length: 5000 words 8 [ ] Michael Völske 2013
9 The Webis-TRC-12 Dataset Construction Overview: Authors Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 9 [ ] Michael Völske 2013
10 The Webis-TRC-12 Dataset Construction Overview: Authors Number of Essays Author ID Crowdsourcing: 27 total Professional writers hired on odesk + volunteers Fluent English speakers 10 [ ] Michael Völske 2013
11 The Webis-TRC-12 Dataset Construction Overview: Authors Number of Essays Author Demographics (n=12) Age (Median) 37 Years Writing (Median) 8 Academic degree Postgrad 33% Undergrad 25% English Native 67% Second Language 33% Author ID Crowdsourcing: 27 total Professional writers hired on odesk + volunteers Fluent English speakers 11 [ ] Michael Völske 2013
12 The Webis-TRC-12 Dataset Construction Overview: Sources Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 12 [ ] Michael Völske 2013
13 The Webis-TRC-12 Dataset Construction Overview: Sources ClueWeb09: 500 million English pages Representative sample of the web Commonly used in search engine evaluation (TREC) 13 [ ] Michael Völske 2013
14 The Webis-TRC-12 Dataset Construction Overview: Search Engine Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 14 [ ] Michael Völske 2013
15 The Webis-TRC-12 Dataset Construction Overview: Search Engine [Potthast et al. 2012a] Used for source retrieval Indexes ClueWeb Records fine-grained interaction log [example] 15 [ ] Michael Völske 2013
16 The Webis-TRC-12 Dataset Construction Overview: Editor Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 16 [ ] Michael Völske 2013
17 The Webis-TRC-12 Dataset Construction Overview: Editor [Potthast et al. 2012b] 17 [ ] Michael Völske 2013
18 The Webis-TRC-12 Dataset Construction Overview: Editor [Potthast et al. 2012b] Custom web-based rich text editor Records sources of re-used text passages New revision every 300ms of inactivity Detailed revision history [example] 18 [ ] Michael Völske 2013
19 The Webis-TRC-12 Dataset Three Main Data Sources Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 19 [ ] Michael Völske 2013
20 The Webis-TRC-12 Dataset Research Questions 20 [ ] Michael Völske 2013
21 The Webis-TRC-12 Dataset Research Questions 1. Different text reuse approaches distinguishable? 2. Relation to existing plagiarism categorizations? 3. Influence of text reuse task on search engine interaction? 4. Detectable via query log analysis? We expect new research insights and impacts to text reuse detection query formulation paraphrasing 21 [ ] Michael Völske 2013
22 The Webis-TRC-12 Dataset Research Questions 1. Different text reuse approaches distinguishable? 2. Relation to existing plagiarism categorizations? 3. Influence of text reuse task on search engine interaction? 4. Detectable via query log analysis? We expect new research insights and impacts to text reuse detection query formulation paraphrasing 22 [ ] Michael Völske 2013
23 The Webis-TRC-12 Dataset Research Questions 1. Different text reuse approaches distinguishable? 2. Relation to existing plagiarism categorizations? 3. Influence of text reuse task on search engine interaction? 4. Detectable via query log analysis? We expect new research insights and impacts to text reuse detection query formulation paraphrasing 23 [ ] Michael Völske 2013
24 Categorizing Crowdsourced Text Reuse 24 [ ] Michael Völske 2013
25 Categorizing Crowdsourced Text Reuse Build-Up Versus Boil-Down Reuse Build-up reuse (left) versus boil-down reuse (right). text length (y-axis) over text revision (x-axis) colors: different source documents (original text is white) blue dots: position of the writer s last edit Build-up: 45%; boil-down: 40%; mixed: 12% 25 [ ] Michael Völske 2013
26 Categorizing Crowdsourced Text Reuse Build-Up Versus Boil-Down Reuse Build-up reuse (left) versus boil-down reuse (right). text length (y-axis) over text revision (x-axis) colors: different source documents (original text is white) blue dots: position of the writer s last edit Build-up: 45%; boil-down: 40%; mixed: 12% 26 [ ] Michael Völske 2013
27 Categorizing Crowdsourced Text Reuse Build-Up Versus Boil-Down Reuse Author 6 (12 topics) Author 20 (9 topics) Author 21 (21 topics) Text length Text length Text length Edit number Edit number Edit number Build-up reuse: Averaged editing histories by authors. one author per plot gray lines: individual essays black line: average 27 [ ] Michael Völske 2013
28 Categorizing Crowdsourced Text Reuse Build-Up Versus Boil-Down Reuse Author 2 (66 topics) Author 7 (20 topics) Author 24 (27 topics) Text length Text length Text length Edit number Edit number Edit number Boil-down reuse: Averaged editing histories by authors. one author per plot gray lines: individual essays black line: average 28 [ ] Michael Völske 2013
29 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Find-Replace Remix Clone, Ctrl-C Mashup Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving 29 [ ] Michael Völske 2013
30 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving 30 [ ] Michael Völske 2013
31 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse high Paraphrasing Find-Replace Remix Quantify: N-Gram similarity and ratio of passages to sources [details] Measure for all essays low Clone, Ctrl-C low Interleaving Mashup high Hypothesis: will show evidence of authors individual text reuse styles Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving 31 [ ] Michael Völske 2013
32 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing Clone, Ctrl-C Mashup low low Interleaving high Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (topic count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving 32 [ ] Michael Völske 2013
33 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 33 [ ] Michael Völske 2013
34 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 34 [ ] Michael Völske 2013
35 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 35 [ ] Michael Völske 2013
36 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 36 [ ] Michael Völske 2013
37 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 37 [ ] Michael Völske 2013
38 Search Missions For Source Retrieval 38 [ ] Michael Völske 2013
39 Search Missions For Source Retrieval Distribution of Queries Over Time A B C D E F Distribution of queries over time. fraction of posed queries (y-axis) over elapsed time (x-axis) between the first query until essay completion each cell represents one of 150 essays the numbers denote the total amount of posed queries the cells are sorted by area under the curve 39 [ ] Michael Völske 2013
40 Search Missions For Source Retrieval Distribution of Queries Over Time A B C D E F Distribution of queries over time. fraction of posed queries (y-axis) over elapsed time (x-axis) between the first query until essay completion each cell represents one of 150 essays the numbers denote the total amount of posed queries the cells are sorted by area under the curve 40 [ ] Michael Völske 2013
41 Search Missions For Source Retrieval Distribution of Queries Over Time Distribution of queries over time. fraction of posed queries (y-axis) over elapsed time (x-axis) between the first query until essay completion each cell represents one of 150 essays the numbers denote the total amount of posed queries the cells are sorted by area under the curve 41 [ ] Michael Völske 2013
42 Search Missions For Source Retrieval Correlation of Editing and Querying 1 Author 5 (18 topics) Author 20 (9 topics) Author 2 (33 topics) Author 24 (13 topics) Correlation of editing and querying behavior. averaged editing histories by authors [plots] distribution of queries over time [plots] 42 [ ] Michael Völske 2013
43 Summary 1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse 2. Evidence of two fundamental editing strategies: build-up & boil-down 3. New classification scheme for documents in a text reuse corpus 4. Relationship between editing behavior and search engine use Future Work 1. Interleaving and paraphrasing in the time dimension 2. Authors text reuse strategies across multiple documents 3. Paraphrasing study: track individual passages over time Thank you for your attention! 43 [ ] Michael Völske 2013
44 Summary 1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse 2. Evidence of two fundamental editing strategies: build-up & boil-down 3. New classification scheme for documents in a text reuse corpus 4. Relationship between editing behavior and search engine use Future Work 1. Interleaving and paraphrasing in the time dimension 2. Authors text reuse strategies across multiple documents 3. Paraphrasing study: track individual passages over time Thank you for your attention! 44 [ ] Michael Völske 2013
45 Summary 1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse 2. Evidence of two fundamental editing strategies: build-up & boil-down 3. New classification scheme for documents in a text reuse corpus 4. Relationship between editing behavior and search engine use Future Work 1. Interleaving and paraphrasing in the time dimension 2. Authors text reuse strategies across multiple documents 3. Paraphrasing study: track individual passages over time Thank you for your attention! 45 [ ] Michael Völske 2013
46
47 References [Potthast 2011] Martin Potthast. Technologies for Reusing Text from the Web. Dissertation, Bauhaus-Universität Weimar, December [Potthast et al. 2012a] Martin Potthast, Matthias Hagen, Benno Stein, Jan Graßegger, Maximilian Michel, Martin Tippmann, and Clement Welsch. ChatNoir: A Search Engine for the ClueWeb09 Corpus. SIGIR 2012, Portland, Oregon, August [Potthast et al. 2012b] Martin Potthast, Tim Gollub, Matthias Hagen, Jan Graßegger, Johannes Kiesel, Maximilian Michel, Arnd Oberländer, Martin Tippmann, Alberto Barrón-Cedeño, Parth Gupta, Paolo Rosso, and Benno Stein. Overview of the 4th International Competition on Plagiarism Detection. CLEF 2012 Evaluation Labs and Workshop Working Notes Papers, September [<]
48 Example topic: Obama s family. Write about President Barack Obama s family history, including genealogy, national origins, places and dates of birth, etc. Where did Barack Obama s parents and grandparents come from? Also include a brief biography of Obama s mother. [<]
49 Example topic: Obama s family. Write about President Barack Obama s family history, including genealogy, national origins, places and dates of birth, etc. Where did Barack Obama s parents and grandparents come from? Also include a brief biography of Obama s mother. Original topic 001 of the TREC Web Track 2009: Query. obama family tree Description. Find information on President Barack Obama s family history, including genealogy, national origins, places and dates of birth, etc. Sub-topic 1. Find the TIME magazine photo essay Barack Obama s Family Tree. Sub-topic 2. Where did Barack Obama s parents and grandparents come from? Sub-topic 3. Find biographical information on Barack Obama s mother. [<]
50 Type Rank Frequency Severity Clone 1 1 Exact copy of another author s work Mashup 2 3 A mix of material copied verbatim from several sources Ctrl-C 3 2 Significant portions of text copied from a single source Remix 4 9 Paraphrasing from several sources and making the content fit together seamlessly Recycle 5 5 Self-plagiarism Re-Tweet 6 10 Proper citation, but closely follows a single source Find-Replace 7 7 Near copy of a single source, with key phrases changed Aggregator 8 4 Proper citation, but (almost) no original work 404 Error 9 6 Citations to non-existent or inaccurate information about sources Hybrid 10 8 Combining properly cited sources with plagiarism in one paper [Turnitin 2012] [<]
51 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix How to quantify? Paraphrasing Measure at the passage level Passage: Block of text reused from the same source Paraphrasing: simple N-Gram similarity low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]
52 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix Three passages: Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]
53 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing: N-Gram similarity Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]
54 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing: N-Gram similarity Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]
55 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing: N-Gram similarity Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]
56 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Paraphrasing Find-Replace Remix Paraphrasing: N-Gram similarity ϕ n (N c, N s ) := N c N s N c N c : N-Grams in the passage N s : N-Grams in the source low Clone, Ctrl-C Mashup We choose n = 5. low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]
57 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Paraphrasing Find-Replace Remix Interleaving: Passages per source C: passages S: sources pps(c, S) := C S low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]
Exploratory Search Missions for TREC Topics
Exploratory Search Missions for TREC Topics EuroHCIR 2013 Dublin, 1 August 2013 Martin Potthast Matthias Hagen Michael Völske Benno Stein Bauhaus-Universität Weimar www.webis.de Dataset for studying text
More informationOverview of the 5th International Competition on Plagiarism Detection
Overview of the 5th International Competition on Plagiarism Detection Martin Potthast, Matthias Hagen, Tim Gollub, Martin Tippmann, Johannes Kiesel, and Benno Stein Bauhaus-Universität Weimar www.webis.de
More informationCrowdsourcing Interaction Logs to Understand Text Reuse from the Web
In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1212-1221, August 2013. ACL. Crowdsourcing Interaction Logs to Understand Text Reuse
More informationImproving Synoptic Querying for Source Retrieval
Improving Synoptic Querying for Source Retrieval Notebook for PAN at CLEF 2015 Šimon Suchomel and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,brandejs}@fi.muni.cz Abstract Source
More informationOverview of the 6th International Competition on Plagiarism Detection
Overview of the 6th International Competition on Plagiarism Detection Martin Potthast, Matthias Hagen, Anna Beyer, Matthias Busse, Martin Tippmann, and Benno Stein Bauhaus-Universität Weimar www.webis.de
More informationDiverse Queries and Feature Type Selection for Plagiarism Discovery
Diverse Queries and Feature Type Selection for Plagiarism Discovery Notebook for PAN at CLEF 2013 Šimon Suchomel, Jan Kasprzak, and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,kas,brandejs}@fi.muni.cz
More informationExpanded N-Grams for Semantic Text Alignment
Expanded N-Grams for Semantic Text Alignment Notebook for PAN at CLEF 2014 Samira Abnar, Mostafa Dehghani, Hamed Zamani, and Azadeh Shakery School of ECE, College of Engineering, University of Tehran,
More informationWebis at the TREC 2012 Session Track
Webis at the TREC 2012 Session Track Extended Abstract for the Conference Notebook Matthias Hagen, Martin Potthast, Matthias Busse, Jakob Gomoll, Jannis Harder, and Benno Stein Bauhaus-Universität Weimar
More informationHashing and Merging Heuristics for Text Reuse Detection
Hashing and Merging Heuristics for Text Reuse Detection Notebook for PAN at CLEF-2014 Faisal Alvi 1, Mark Stevenson 2, Paul Clough 2 1 King Fahd University of Petroleum & Minerals, Saudi Arabia, 2 University
More informationPredicting Retrieval Success Based on Information Use for Writing Tasks
This is a post-print version of the article: Vakkari P., Völske M., Potthast M., Hagen M., Stein B. (2018) Predicting Retrieval Success Based on Information Use for Writing Tasks. In: Méndez E., Crestani
More informationSimilarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection
Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection Notebook for PAN at CLEF 2012 Arun kumar Jayapal The University of Sheffield, UK arunkumar.jeyapal@gmail.com Abstract
More informationSupervised Ranking for Plagiarism Source Retrieval
Supervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2013 Kyle Williams, Hung-Hsuan Chen, and C. Lee Giles, Information Sciences and Technology Computer Science and Engineering Pennsylvania
More informationSupporting Scholarly Search with Keyqueries
Supporting Scholarly Search with Keyqueries Matthias Hagen Anna Beyer Tim Gollub Kristof Komlossy Benno Stein Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de @matthias_hagen ECIR 2016 Padova, Italy
More informationCandidate Document Retrieval for Web-scale Text Reuse Detection
Candidate Document Retrieval for Web-scale Text Reuse Detection Matthias Hagen Benno Stein Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de SPIRE 2011 Pisa, Italy October 19, 2011 Matthias Hagen,
More informationNew Issues in Near-duplicate Detection
New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Motivation About 30% of the Web is redundant. [Fetterly 03, Broder
More informationInterpreting Turnitin Originality Reports
Interpreting Turnitin Originality Reports This guide will provide a brief introduction to using and interpreting Originality Reports in Turnitin via Canvas. Turnitin (TII) Originality reports allow you
More informationThree way search engine queries with multi-feature document comparison for plagiarism detection
Three way search engine queries with multi-feature document comparison for plagiarism detection Notebook for PAN at CLEF 2012 Šimon Suchomel, Jan Kasprzak, and Michal Brandejs Faculty of Informatics, Masaryk
More informationThe Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014
The Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014 Notebook for PAN at CLEF 2014 Miguel A. Sanchez-Perez, Grigori Sidorov, Alexander Gelbukh Centro de Investigación en Computación,
More informationPlagiarism Detection. Lucia D. Krisnawati
Plagiarism Detection Lucia D. Krisnawati Overview Introduction to Automatic Plagiarism Detection (PD) External PD Intrinsic PD Evaluation Framework Have You Heard, Seen or Used These? The Available PD
More informationAuthor Masking through Translation
Author Masking through Translation Notebook for PAN at CLEF 2016 Yashwant Keswani, Harsh Trivedi, Parth Mehta, and Prasenjit Majumder Dhirubhai Ambani Institute of Information and Communication Technology
More informationDetection of Text Plagiarism and Wikipedia Vandalism
Detection of Text Plagiarism and Wikipedia Vandalism Benno Stein Bauhaus-Universität Weimar www.webis.de Keynote at SEPLN, Valencia, 8. Sep. 2010 The webis Group Benno Stein Dennis Hoppe Maik Anderka Martin
More informationAuthor Verification: Exploring a Large set of Parameters using a Genetic Algorithm
Author Verification: Exploring a Large set of Parameters using a Genetic Algorithm Notebook for PAN at CLEF 2014 Erwan Moreau 1, Arun Jayapal 2, and Carl Vogel 3 1 moreaue@cs.tcd.ie 2 jayapala@cs.tcd.ie
More informationClassifying and Ranking Search Engine Results as Potential Sources of Plagiarism
Classifying and Ranking Search Engine Results as Potential Sources of Plagiarism Kyle Williams, Hung-Hsuan Chen, C. Lee Giles Information Sciences and Technology, Computer Science and Engineering Pennsylvania
More informationDynamically Adjustable Approach through Obfuscation Type Recognition
Dynamically Adjustable Approach through Obfuscation Type Recognition Notebook for PAN at CLEF 2015 Miguel A. Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov Centro de Investigación en Computación,
More informationTE Teacher s Edition PE Pupil Edition Page 1
Standard 4 WRITING: Writing Process Students discuss, list, and graphically organize writing ideas. They write clear, coherent, and focused essays. Students progress through the stages of the writing process
More informationQuestion Answering Systems
Question Answering Systems An Introduction Potsdam, Germany, 14 July 2011 Saeedeh Momtazi Information Systems Group Outline 2 1 Introduction Outline 2 1 Introduction 2 History Outline 2 1 Introduction
More informationContextual Information Retrieval Using Ontology-Based User Profiles
Contextual Information Retrieval Using Ontology-Based User Profiles Vishnu Kanth Reddy Challam Master s Thesis Defense Date: Jan 22 nd, 2004. Committee Dr. Susan Gauch(Chair) Dr.David Andrews Dr. Jerzy
More information*Note: To find a complete list of sources, click on the List of Sources link in the top portion of every Biography Resource Center search screen.
Biography Resource Center Navigation Guide OVERVIEW The Biography Resource Center (BioRC) is a comprehensive database of biographical information on over 380,000 people from throughout history, around
More informationRishiraj Saha Roy and Niloy Ganguly IIT Kharagpur India. Monojit Choudhury and Srivatsan Laxman Microsoft Research India India
Rishiraj Saha Roy and Niloy Ganguly IIT Kharagpur India Monojit Choudhury and Srivatsan Laxman Microsoft Research India India ACM SIGIR 2012, Portland August 15, 2012 Dividing a query into individual semantic
More informationQuery Session Detection as a Cascade
Query Session Detection as a Cascade Extended Abstract Matthias Hagen, Benno Stein, and Tino Rüb Bauhaus-Universität Weimar .@uni-weimar.de Abstract We propose a cascading method
More informationNatural Language Processing. SoSe Question Answering
Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation
More informationSacCT Creating a Turnitin Assignment How to Guide
SacCT Creating a Turnitin Assignment How to Guide HOW TO GUIDE WHAT IS A TURNITIN ASSIGNMENT CALIFORNIA STATE UNIVERSITY, SACRAMENTO Turnitin is an online plagiarism detection resource that is available
More informationOnline Compiler with Plagiarism Checker
Volume 118 No. 18 2018, 1547-1555 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Online Compiler with Plagiarism Checker Aarthi G.V. 1, a) 2, b),,
More informationcode pattern analysis of object-oriented programming languages
code pattern analysis of object-oriented programming languages by Xubo Miao A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen s
More informationData Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners
Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager
More informationSocial Behavior Prediction Through Reality Mining
Social Behavior Prediction Through Reality Mining Charlie Dagli, William Campbell, Clifford Weinstein Human Language Technology Group MIT Lincoln Laboratory This work was sponsored by the DDR&E / RRTO
More informationResearch Question: Do people perform better on a math task after observing Impressionist art or modern art?
MEDIA LAB TUTORIAL Research Question: Do people perform better on a math task after observing Impressionist art or modern art? All of the graphics you will use should be stored in a folder that you will
More informationAdaptive Algorithm for Plagiarism Detection: The Best-performing Approach at PAN 2014 Text Alignment Competition
Adaptive Algorithm for Plagiarism Detection: The Best-performing Approach at PAN 2014 Text Alignment Competition Miguel A. Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov Centro de Investigación
More informationA Deep Relevance Matching Model for Ad-hoc Retrieval
A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More informationExploration of Fuzzy C Means Clustering Algorithm in External Plagiarism Detection System
Exploration of Fuzzy C Means Clustering Algorithm in External Plagiarism Detection System N. Riya Ravi, K. Vani and Deepa Gupta Abstract With the advent of World Wide Web, plagiarism has become a prime
More informationRepurposing Benchmark Corpora for Reconstructing Provenance
Repurposing Benchmark Corpora for Reconstructing Provenance Sara Magliacane, Paul Groth Department of Computer Science VU University Amsterdam s.magliacane@vu.nl, p.t.groth@vu.nl Abstract. Provenance is
More informationAlexander Calder Surrealist Sculptor. By Klara Schmitt
Alexander Calder Surrealist Sculptor By Klara Schmitt Table of Contents 2 Creative Brief...3 Target Audience...4 Content Outline...6 Navigation Map...7 Wireframes and Design Comps...8 Styles...34 Credits...35
More informationThe information here will help you avoid violating university policy about copying material written by others.
Blackboard and its plagiarism checking tool SafeAssign will fully retire throughout all components of UTHealth by fall 2016. At that point, all UTHealth courses will be taught in Canvas using Turnitin
More informationInternational Graduate School of Genetic and Molecular Epidemiology (GAME) Computing Notes and Introduction to Stata
International Graduate School of Genetic and Molecular Epidemiology (GAME) Computing Notes and Introduction to Stata Paul Dickman September 2003 1 A brief introduction to Stata Starting the Stata program
More informationAutomatically Generating Queries for Prior Art Search
Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation
More informationDeep Learning based Authorship Identification
Deep Learning based Authorship Identification Chen Qian Tianchang He Rao Zhang Department of Electrical Engineering Stanford University, Stanford, CA 94305 cqian23@stanford.edu th7@stanford.edu zhangrao@stanford.edu
More informationA RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH
A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements
More informationA Study of MatchPyramid Models on Ad hoc Retrieval
A Study of MatchPyramid Models on Ad hoc Retrieval Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences Text Matching Many text based
More informationTurnitin Instructor Guide
Table of Contents Turnitin Overview... 1 Recommended Steps... 2 IMPORTANT! Do NOT Copy Turnitin Assignments... 2 Browser Requirements... 2 Creating a Turnitin Assignment... 2 Editing a Turnitin Assignment...
More informationEVALUATION AND IMPLEMENTATION OF N -GRAM-BASED ALGORITHM FOR FAST TEXT COMPARISON
Computing and Informatics, Vol. 36, 2017, 887 907, doi: 10.4149/cai 2017 4 887 EVALUATION AND IMPLEMENTATION OF N -GRAM-BASED ALGORITHM FOR FAST TEXT COMPARISON Maciej Wielgosz, Pawe l Szczepka, Pawe l
More informationInformativeness for Adhoc IR Evaluation:
Informativeness for Adhoc IR Evaluation: A measure that prevents assessing individual documents Romain Deveaud 1, Véronique Moriceau 2, Josiane Mothe 3, and Eric SanJuan 1 1 LIA, Univ. Avignon, France,
More informationFine-grained Software Version Control Based on a Program s Abstract Syntax Tree
Master Thesis Description and Schedule Fine-grained Software Version Control Based on a Program s Abstract Syntax Tree Martin Otth Supervisors: Prof. Dr. Peter Müller Dimitar Asenov Chair of Programming
More informationLab Benchmarking Symposium Introduction. Alison Farmer
Lab Benchmarking Symposium Introduction Alison Farmer Purpose of the Symposium Describe state of lab benchmarking today Introduce the I 2 SL Lab Benchmarking Working Group Reveal latest updates to Labs21
More informationThe Labs21 Benchmarking Tool
I 2 SL Tools for Sustainability Success: The Labs21 Benchmarking Tool Alison Farmer Outline Why we care about benchmarking The Labs21 Benchmarking Tool New upgrades by the I 2 SL Benchmarking Working Group
More informationNatural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)
Natural Language Processing SoSe 2015 Question Answering Dr. Mariana Neves July 6th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Introduction History QA Architecture Outline 3 Introduction
More informationOverview of Classification Subtask at NTCIR-6 Patent Retrieval Task
Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task Makoto Iwayama *, Atsushi Fujii, Noriko Kando * Hitachi, Ltd., 1-280 Higashi-koigakubo, Kokubunji, Tokyo 185-8601, Japan makoto.iwayama.nw@hitachi.com
More informationSemantic Estimation for Texts in Software Engineering
Semantic Estimation for Texts in Software Engineering 汇报人 : Reporter:Xiaochen Li Dalian University of Technology, China 大连理工大学 2016 年 11 月 29 日 Oscar Lab 2 Ph.D. candidate at OSCAR Lab, in Dalian University
More informationIntroduction to Microsoft Office PowerPoint 2010
Introduction to Microsoft Office PowerPoint 2010 TABLE OF CONTENTS Open PowerPoint 2010... 1 About the Editing Screen... 1 Create a Title Slide... 6 Save Your Presentation... 6 Create a New Slide... 7
More informationA Study on Query Expansion with MeSH Terms and Elasticsearch. IMS Unipd at CLEF ehealth Task 3
A Study on Query Expansion with MeSH Terms and Elasticsearch. IMS Unipd at CLEF ehealth Task 3 Giorgio Maria Di Nunzio and Alexandru Moldovan Dept. of Information Engineering University of Padua giorgiomaria.dinunzio@unipd.it,alexandru.moldovan@studenti.unipd.it
More informationScanning methods and language modeling for binary switch typing
Scanning methods and language modeling for binary switch typing Brian Roark, Jacques de Villiers, Christopher Gibbons and Melanie Fried-Oken Center for Spoken Language Understanding Child Development &
More informationMaster Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala
Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue
More informationQDA Miner. Addendum v2.0
QDA Miner Addendum v2.0 QDA Miner is an easy-to-use qualitative analysis software for coding, annotating, retrieving and reviewing coded data and documents such as open-ended responses, customer comments,
More informationOverview of the Patent Retrieval Task at the NTCIR-6 Workshop
Overview of the Patent Retrieval Task at the NTCIR-6 Workshop Atsushi Fujii, Makoto Iwayama, Noriko Kando Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba,
More informationMMC 4936 Web GIS for Journalists
Florida International University FIU Digital Commons Course Syllabi Special Collections and University Archives Fall 2014 MMC 4936 Web GIS for Journalists Susan Jacobson Journalism and Mass Communications
More informationControl Invitation
Online Appendices Appendix A. Invitation Emails Control Invitation Email Subject: Reviewer Invitation from JPubE You are invited to review the above-mentioned manuscript for publication in the. The manuscript's
More informationTEXT CHAPTER 5. W. Bruce Croft BACKGROUND
41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia
More informationNYU CSCI-GA Fall 2016
1 / 45 Information Retrieval: Personalization Fernando Diaz Microsoft Research NYC November 7, 2016 2 / 45 Outline Introduction to Personalization Topic-Specific PageRank News Personalization Deciding
More informationWriting Reports with Report Designer and SSRS 2014 Level 1
Writing Reports with Report Designer and SSRS 2014 Level 1 Duration- 2days About this course In this 2-day course, students are introduced to the foundations of report writing with Microsoft SQL Server
More informationGet the most value from your surveys with text analysis
SPSS Text Analysis for Surveys 3.0 Specifications Get the most value from your surveys with text analysis The words people use to answer a question tell you a lot about what they think and feel. That s
More informationSupplementary Material
Supplementary Material Figure 1S: Scree plot of the 400 dimensional data. The Figure shows the 20 largest eigenvalues of the (normalized) correlation matrix sorted in decreasing order; the insert shows
More informationNVIvo 11 workshop. Hui Bian Office for Faculty Excellence
1 NVIvo 11 workshop Hui Bian Office for Faculty Excellence Contact information 2 Email: bianh@ecu.edu Phone: 328-5428 1001 Joyner Library, room 1006 Website: http://core.ecu.edu/ofe/st atisticsresearch/
More informationA Language Independent Author Verifier Using Fuzzy C-Means Clustering
A Language Independent Author Verifier Using Fuzzy C-Means Clustering Notebook for PAN at CLEF 2014 Pashutan Modaresi 1,2 and Philipp Gross 1 1 pressrelations GmbH, Düsseldorf, Germany {pashutan.modaresi,
More informationGeometric Techniques. Part 1. Example: Scatter Plot. Basic Idea: Scatterplots. Basic Idea. House data: Price and Number of bedrooms
Part 1 Geometric Techniques Scatterplots, Parallel Coordinates,... Geometric Techniques Basic Idea Visualization of Geometric Transformations and Projections of the Data Scatterplots [Cleveland 1993] Parallel
More informationNick Terkay CSCI 7818 Web Services 11/16/2006
Nick Terkay CSCI 7818 Web Services 11/16/2006 Ning? Start-up co-founded by Marc Andreeson, the co- founder of Netscape. October 2005 Ning is an online platform for painlessly creating web apps in a jiffy.
More informationResearch Article. August 2017
International Journals of Advanced Research in Computer Science and Software Engineering ISSN: 2277-128X (Volume-7, Issue-8) a Research Article August 2017 English-Marathi Cross Language Information Retrieval
More informationScientific Graphing in Excel 2013
Scientific Graphing in Excel 2013 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.
More informationLearning to Reweight Terms with Distributed Representations
Learning to Reweight Terms with Distributed Representations School of Computer Science Carnegie Mellon University August 12, 215 Outline Goal: Assign weights to query terms for better retrieval results
More informationE-Marefa User Guide. "Arab Theses and Dissertations"
E-Marefa User Guide "Arab Theses and Dissertations" Table of Contents What is E-Marefa Database.3 System Requirements 3 Inside this User Guide 3 Access to E-Marefa Database.....4 Choosing Database to Search.5
More informationMDSWriter: Annotation tool for creating high-quality. multi-document summarization corpora.
MDSWriter: Annotation tool for creating high-quality multi-document summarization corpora Christian M. Meyer, Darina Benikova, Margot Mieskes, and Iryna Gurevych Research Training Group AIPHES Ubiquitous
More informationQueryLines: Approximate Query for Visual Browsing
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com QueryLines: Approximate Query for Visual Browsing Kathy Ryall, Neal Lesh, Tom Lanning, Darren Leigh, Hiroaki Miyashita and Shigeru Makino TR2005-015
More informationInformation Search in Web Archives
Information Search in Web Archives Miguel Costa Advisor: Prof. Mário J. Silva Co-Advisor: Prof. Francisco Couto Department of Informatics, Faculty of Sciences, University of Lisbon PhD thesis defense,
More informationIntroduction April 27 th 2016
Social Web Mining Summer Term 2016 1 Introduction April 27 th 2016 Dr. Darko Obradovic Insiders Technologies GmbH Kaiserslautern d.obradovic@insiders-technologies.de Outline for Today 1.1 1.2 1.3 1.4 1.5
More informationLab Activity #2- Statistics and Graphing
Lab Activity #2- Statistics and Graphing Graphical Representation of Data and the Use of Google Sheets : Scientists answer posed questions by performing experiments which provide information about a given
More informationDeep Character-Level Click-Through Rate Prediction for Sponsored Search
Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as
More informationQuery Heartbeat: A Strange Property of Keyword Queries on the Web
Query Heartbeat: A Strange Property of Keyword Queries on the Web Karthik.B.R Aditya Ramana Rachakonda Dr. Srinath Srinivasa International Institute of Information Technology, Bangalore. December 16, 2008
More informationPractice Questions. AGENDA: Brief Library Overview + APA style ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RESEARCH SKILLS INSTRUCTION - Library Class Outline Centennial Libraries homepage http://library.centennialcollege.ca/ - Liz Dobson & Eric Wilding COMM 170-28 & 29 October 2015 2015 Sept AGENDA: Brief
More informationExtracting Information from Social Networks
Extracting Information from Social Networks Reminder: Social networks Catch-all term for social networking sites Facebook microblogging sites Twitter blog sites (for some purposes) 1 2 Ways we can use
More informationAggregates Geometric Parameters
Aggregates Geometric Parameters On-Line or Lab Measurement of Size and Shapes by 3D Image Analysis Terry Stauffer Application Note SL-AN-48 Revision A Provided By: Microtrac Particle Characterization Solutions
More informationEntity and Knowledge Base-oriented Information Retrieval
Entity and Knowledge Base-oriented Information Retrieval Presenter: Liuqing Li liuqing@vt.edu Digital Library Research Laboratory Virginia Polytechnic Institute and State University Blacksburg, VA 24061
More informationThis research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight
This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight local variation of one variable with respect to another.
More informationEXCEL SPREADSHEET TUTORIAL
EXCEL SPREADSHEET TUTORIAL Note to all 200 level physics students: You will be expected to properly format data tables and graphs in all lab reports, as described in this tutorial. Therefore, you are responsible
More informationDELL EMC UNITY: DATA REDUCTION
DELL EMC UNITY: DATA REDUCTION Overview ABSTRACT This white paper is an introduction to the Dell EMC Unity Data Reduction feature. It provides an overview of the feature, methods for managing data reduction,
More informationIllinois State University Events Calendar
Illinois State University Events Calendar Adding an Event To add an event to the Illinois State University Events Calendar, visit Events.IllinoisState.edu and scroll to the footer to login. Click the Copyright
More informationTREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University
TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University of Maryland, College Park, MD 20742 oard@glue.umd.edu
More informationHOLY SPIRIT UNIVERSITY OF KASLIK LIBRARY HELP GUIDES. USEK. Student Manual: Detecting and Deterring plagiarism in student work
HOLY SPIRIT UNIVERSITY OF KASLIK LIBRARY HELP GUIDES Turnitin @ USEK Student Manual: Detecting and Deterring plagiarism in student work Updated September 14, 2018 OUTLINE 1. Definition of plagiarism 1
More informationTilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval
Tilburg University Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Publication date: 2006 Link to publication Citation for published
More informationQuery Session Detection as a Cascade
Query Session Detection as a Cascade Matthias Hagen Benno Stein Tino Rüb Bauhaus-Universität Weimar matthias.hagen@uni-weimar.de CIKM 2011 Glasgow, Scotland October 25, 2011 Hagen, Stein, Rüb Query Session
More informationCanvas Student Guide. The Office of Online Learning Massasoit Community College
Canvas Student Guide The Office of Online Learning Massasoit Community College www.massasoit.edu TABLE OF CONTENTS What is Canvas?... 1 Computer and Browser Requirements... 1 Mobile Support... 1 Accessing
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More information