Analyzing a Large Corpus of Crowdsourced Plagiarism

Size: px

Start display at page:

Download "Analyzing a Large Corpus of Crowdsourced Plagiarism"

Tyrone Terry
5 years ago
Views:

1 Analyzing a Large Corpus of Crowdsourced Plagiarism Master s Thesis Michael Völske Fakultät Medien Bauhaus-Universität Weimar Supervised by: Prof. Dr. Benno Stein Prof. Dr. Volker Rodehorst Advisors: Dr. Steven Burrows Dr. Matthias Hagen Dr. Martin Potthast

2 Analyzing a Large Corpus of Crowdsourced Plagiarism Master s Thesis Outline Introduction The Webis-TRC-12 Dataset Categorizing Crowdsourced Text Reuse Search Missions For Source Retrieval Summary

3 Introduction Modeling Text Reuse From the Web Search Copy & Paste Paraphrase Search I m Feeling Lucky [Potthast 2011] 3 [ ] Michael Völske 2013

4 Introduction Previous Plagiarism Corpora Source Selection Automatic Excerpt Automatic Modification Insertion Automatically generated artificial plagiarism PAN-PC-10/11: > documents; > plagiarism cases Automatic transformations do not preserve semantics 4 [ ] Michael Völske 2013

5 The Webis-TRC-12 Dataset 5 [ ] Michael Völske 2013

6 The Webis-TRC-12 Dataset Construction Overview Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 6 [ ] Michael Völske 2013

7 The Webis-TRC-12 Dataset Construction Overview: Topics Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 7 [ ] Michael Völske 2013

8 The Webis-TRC-12 Dataset Construction Overview: Topics Example topic: Obama s family. Write about President Barack Obama s family history, including genealogy, national origins, places and dates of birth, etc. Where did Barack Obama s parents and grandparents come from? Also include a brief biography of Obama s mother. Based on TREC Web Track topics topics, 297 essays Target essay length: 5000 words 8 [ ] Michael Völske 2013

9 The Webis-TRC-12 Dataset Construction Overview: Authors Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 9 [ ] Michael Völske 2013

10 The Webis-TRC-12 Dataset Construction Overview: Authors Number of Essays Author ID Crowdsourcing: 27 total Professional writers hired on odesk + volunteers Fluent English speakers 10 [ ] Michael Völske 2013

11 The Webis-TRC-12 Dataset Construction Overview: Authors Number of Essays Author Demographics (n=12) Age (Median) 37 Years Writing (Median) 8 Academic degree Postgrad 33% Undergrad 25% English Native 67% Second Language 33% Author ID Crowdsourcing: 27 total Professional writers hired on odesk + volunteers Fluent English speakers 11 [ ] Michael Völske 2013

12 The Webis-TRC-12 Dataset Construction Overview: Sources Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 12 [ ] Michael Völske 2013

13 The Webis-TRC-12 Dataset Construction Overview: Sources ClueWeb09: 500 million English pages Representative sample of the web Commonly used in search engine evaluation (TREC) 13 [ ] Michael Völske 2013

14 The Webis-TRC-12 Dataset Construction Overview: Search Engine Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 14 [ ] Michael Völske 2013

15 The Webis-TRC-12 Dataset Construction Overview: Search Engine [Potthast et al. 2012a] Used for source retrieval Indexes ClueWeb Records fine-grained interaction log [example] 15 [ ] Michael Völske 2013

16 The Webis-TRC-12 Dataset Construction Overview: Editor Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 16 [ ] Michael Völske 2013

17 The Webis-TRC-12 Dataset Construction Overview: Editor [Potthast et al. 2012b] 17 [ ] Michael Völske 2013

18 The Webis-TRC-12 Dataset Construction Overview: Editor [Potthast et al. 2012b] Custom web-based rich text editor Records sources of re-used text passages New revision every 300ms of inactivity Detailed revision history [example] 18 [ ] Michael Völske 2013

19 The Webis-TRC-12 Dataset Three Main Data Sources Topic Author Editor ChatNoir SE ClueWeb API Revision log Query log ClueWeb 19 [ ] Michael Völske 2013

20 The Webis-TRC-12 Dataset Research Questions 20 [ ] Michael Völske 2013

21 The Webis-TRC-12 Dataset Research Questions 1. Different text reuse approaches distinguishable? 2. Relation to existing plagiarism categorizations? 3. Influence of text reuse task on search engine interaction? 4. Detectable via query log analysis? We expect new research insights and impacts to text reuse detection query formulation paraphrasing 21 [ ] Michael Völske 2013

22 The Webis-TRC-12 Dataset Research Questions 1. Different text reuse approaches distinguishable? 2. Relation to existing plagiarism categorizations? 3. Influence of text reuse task on search engine interaction? 4. Detectable via query log analysis? We expect new research insights and impacts to text reuse detection query formulation paraphrasing 22 [ ] Michael Völske 2013

23 The Webis-TRC-12 Dataset Research Questions 1. Different text reuse approaches distinguishable? 2. Relation to existing plagiarism categorizations? 3. Influence of text reuse task on search engine interaction? 4. Detectable via query log analysis? We expect new research insights and impacts to text reuse detection query formulation paraphrasing 23 [ ] Michael Völske 2013

24 Categorizing Crowdsourced Text Reuse 24 [ ] Michael Völske 2013

text length (y-axis) over text revision (x-axis) colors: different source documents

25 Categorizing Crowdsourced Text Reuse Build-Up Versus Boil-Down Reuse Build-up reuse (left) versus boil-down reuse (right). text length (y-axis) over text revision (x-axis) colors: different source documents (original text is white) blue dots: position of the writer s last edit Build-up: 45%; boil-down: 40%; mixed: 12% 25 [ ] Michael Völske 2013

26 Categorizing Crowdsourced Text Reuse Build-Up Versus Boil-Down Reuse Build-up reuse (left) versus boil-down reuse (right). text length (y-axis) over text revision (x-axis) colors: different source documents (original text is white) blue dots: position of the writer s last edit Build-up: 45%; boil-down: 40%; mixed: 12% 26 [ ] Michael Völske 2013

27 Categorizing Crowdsourced Text Reuse Build-Up Versus Boil-Down Reuse Author 6 (12 topics) Author 20 (9 topics) Author 21 (21 topics) Text length Text length Text length Edit number Edit number Edit number Build-up reuse: Averaged editing histories by authors. one author per plot gray lines: individual essays black line: average 27 [ ] Michael Völske 2013

28 Categorizing Crowdsourced Text Reuse Build-Up Versus Boil-Down Reuse Author 2 (66 topics) Author 7 (20 topics) Author 24 (27 topics) Text length Text length Text length Edit number Edit number Edit number Boil-down reuse: Averaged editing histories by authors. one author per plot gray lines: individual essays black line: average 28 [ ] Michael Völske 2013

29 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Find-Replace Remix Clone, Ctrl-C Mashup Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving 29 [ ] Michael Völske 2013

30 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving 30 [ ] Michael Völske 2013

31 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse high Paraphrasing Find-Replace Remix Quantify: N-Gram similarity and ratio of passages to sources [details] Measure for all essays low Clone, Ctrl-C low Interleaving Mashup high Hypothesis: will show evidence of authors individual text reuse styles Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving 31 [ ] Michael Völske 2013

32 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing Clone, Ctrl-C Mashup low low Interleaving high Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (topic count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving 32 [ ] Michael Völske 2013

33 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 33 [ ] Michael Völske 2013

34 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 34 [ ] Michael Völske 2013

35 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 35 [ ] Michael Völske 2013

36 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 36 [ ] Michael Völske 2013

37 Categorizing Crowdsourced Text Reuse Classification Scheme for Text Reuse Median 5-gram passage similarity (paraphrasing) Passages per source (interleaving) Author (essay count) A0 (2) A14 (1) A1 (3) A15 (3) A2 (66) A16 (1) A3 (1) A17 (33) A4 (1) A18 (32) A5 (37) A6 (11) A7 (20) A19 (7) A20 (8) A21 (21) A8 (1) A22 (1) A9 (1) A23 (1) A10 (12) A11 (1) A12 (1) A24 (27) A25 (1) A26 (1) A13 (1) 37 [ ] Michael Völske 2013

38 Search Missions For Source Retrieval 38 [ ] Michael Völske 2013

39 Search Missions For Source Retrieval Distribution of Queries Over Time A B C D E F Distribution of queries over time. fraction of posed queries (y-axis) over elapsed time (x-axis) between the first query until essay completion each cell represents one of 150 essays the numbers denote the total amount of posed queries the cells are sorted by area under the curve 39 [ ] Michael Völske 2013

40 Search Missions For Source Retrieval Distribution of Queries Over Time A B C D E F Distribution of queries over time. fraction of posed queries (y-axis) over elapsed time (x-axis) between the first query until essay completion each cell represents one of 150 essays the numbers denote the total amount of posed queries the cells are sorted by area under the curve 40 [ ] Michael Völske 2013

41 Search Missions For Source Retrieval Distribution of Queries Over Time Distribution of queries over time. fraction of posed queries (y-axis) over elapsed time (x-axis) between the first query until essay completion each cell represents one of 150 essays the numbers denote the total amount of posed queries the cells are sorted by area under the curve 41 [ ] Michael Völske 2013

42 Search Missions For Source Retrieval Correlation of Editing and Querying 1 Author 5 (18 topics) Author 20 (9 topics) Author 2 (33 topics) Author 24 (13 topics) Correlation of editing and querying behavior. averaged editing histories by authors [plots] distribution of queries over time [plots] 42 [ ] Michael Völske 2013

43 Summary 1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse 2. Evidence of two fundamental editing strategies: build-up & boil-down 3. New classification scheme for documents in a text reuse corpus 4. Relationship between editing behavior and search engine use Future Work 1. Interleaving and paraphrasing in the time dimension 2. Authors text reuse strategies across multiple documents 3. Paraphrasing study: track individual passages over time Thank you for your attention! 43 [ ] Michael Völske 2013

44 Summary 1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse 2. Evidence of two fundamental editing strategies: build-up & boil-down 3. New classification scheme for documents in a text reuse corpus 4. Relationship between editing behavior and search engine use Future Work 1. Interleaving and paraphrasing in the time dimension 2. Authors text reuse strategies across multiple documents 3. Paraphrasing study: track individual passages over time Thank you for your attention! 44 [ ] Michael Völske 2013

45 Summary 1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse 2. Evidence of two fundamental editing strategies: build-up & boil-down 3. New classification scheme for documents in a text reuse corpus 4. Relationship between editing behavior and search engine use Future Work 1. Interleaving and paraphrasing in the time dimension 2. Authors text reuse strategies across multiple documents 3. Paraphrasing study: track individual passages over time Thank you for your attention! 45 [ ] Michael Völske 2013

47 References [Potthast 2011] Martin Potthast. Technologies for Reusing Text from the Web. Dissertation, Bauhaus-Universität Weimar, December [Potthast et al. 2012a] Martin Potthast, Matthias Hagen, Benno Stein, Jan Graßegger, Maximilian Michel, Martin Tippmann, and Clement Welsch. ChatNoir: A Search Engine for the ClueWeb09 Corpus. SIGIR 2012, Portland, Oregon, August [Potthast et al. 2012b] Martin Potthast, Tim Gollub, Matthias Hagen, Jan Graßegger, Johannes Kiesel, Maximilian Michel, Arnd Oberländer, Martin Tippmann, Alberto Barrón-Cedeño, Parth Gupta, Paolo Rosso, and Benno Stein. Overview of the 4th International Competition on Plagiarism Detection. CLEF 2012 Evaluation Labs and Workshop Working Notes Papers, September [<]

48 Example topic: Obama s family. Write about President Barack Obama s family history, including genealogy, national origins, places and dates of birth, etc. Where did Barack Obama s parents and grandparents come from? Also include a brief biography of Obama s mother. [<]

49 Example topic: Obama s family. Write about President Barack Obama s family history, including genealogy, national origins, places and dates of birth, etc. Where did Barack Obama s parents and grandparents come from? Also include a brief biography of Obama s mother. Original topic 001 of the TREC Web Track 2009: Query. obama family tree Description. Find information on President Barack Obama s family history, including genealogy, national origins, places and dates of birth, etc. Sub-topic 1. Find the TIME magazine photo essay Barack Obama s Family Tree. Sub-topic 2. Where did Barack Obama s parents and grandparents come from? Sub-topic 3. Find biographical information on Barack Obama s mother. [<]

50 Type Rank Frequency Severity Clone 1 1 Exact copy of another author s work Mashup 2 3 A mix of material copied verbatim from several sources Ctrl-C 3 2 Significant portions of text copied from a single source Remix 4 9 Paraphrasing from several sources and making the content fit together seamlessly Recycle 5 5 Self-plagiarism Re-Tweet 6 10 Proper citation, but closely follows a single source Find-Replace 7 7 Near copy of a single source, with key phrases changed Aggregator 8 4 Proper citation, but (almost) no original work 404 Error 9 6 Citations to non-existent or inaccurate information about sources Hybrid 10 8 Combining properly cited sources with plagiarism in one paper [Turnitin 2012] [<]

51 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix How to quantify? Paraphrasing Measure at the passage level Passage: Block of text reused from the same source Paraphrasing: simple N-Gram similarity low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]

Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix Three passages: Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification

52 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix Three passages: Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]

53 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing: N-Gram similarity Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]

54 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing: N-Gram similarity Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]

55 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Find-Replace Remix Paraphrasing: N-Gram similarity Paraphrasing low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]

56 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Paraphrasing Find-Replace Remix Paraphrasing: N-Gram similarity ϕ n (N c, N s ) := N c N s N c N c : N-Grams in the passage N s : N-Grams in the source low Clone, Ctrl-C Mashup We choose n = 5. low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]

57 Details: Paraphrasing & Interleaving Classification Scheme for Text Reuse high Paraphrasing Find-Replace Remix Interleaving: Passages per source C: passages S: sources pps(c, S) := C S low Clone, Ctrl-C Mashup low Interleaving high Classification Scheme for Text Reuse. types of plagiarism as distinguished by Turnitin [Turnitin 2012] interpret text reuse (plagiarism) as a combination of two factors: paraphrasing and interleaving [<]

Exploratory Search Missions for TREC Topics

Exploratory Search Missions for TREC Topics EuroHCIR 2013 Dublin, 1 August 2013 Martin Potthast Matthias Hagen Michael Völske Benno Stein Bauhaus-Universität Weimar www.webis.de Dataset for studying text