Filling the Gaps. Improving Wikipedia Stubs

Size: px

Start display at page:

Download "Filling the Gaps. Improving Wikipedia Stubs"

Hope Burns
5 years ago
Views:

1 Filling the Gaps Improving Wikipedia Stubs Siddhartha Banerjee and Prasenjit Mitra, The Pennsylvania State University, University Park, PA, USA Qatar Computing Research Institute, Doha, Qatar 15th ACM SIGWEB International Symposium on Document Engineering (DocEng), 2015, Lausanne, Switzerland

2 Outline Motivation What is the research problem? Related work Proposed approach Dataset Experimental results Next steps DocEng2015 2

3 Motivation Growth-rate of Wikipedia 700 new articles everyday Still, a backlog of several missing articles Not enough authors Stubs: Very short articles Computers can help write articles DocEng2015 3

4 Research Problem Improving stubs Identifying new relevant sections Appending content to new sections Acceptable content The problem of copyright violation Content from the web cannot be copied [Banerjee et al. (ICPR 2014)] DocEng2015 4

5 Related Work Sauper and Barzilay, ACL 2009: ILP-based approach to create Wikipedia articles automatically Ranking passages and populating Wikipedia But, Wikipedia standards have evolved Banerjee et al., ICPR 2014: Fixed-Template Approach Bot detects theatre play scripts on the web and writes Wikipedia articles about them Signpost, Wikipedia, January 2015 Extractive Summarization Only 2 out of 15 accepted Copyright Violation! DocEng2015 5

6 Proposed Approach Categories on Wikipedia characterize articles surprisingly well Learn from the comprehensive articles in each category DocEng2015 6

word embeddings Document Representation Paragraph vectors (Le and Mikolov,

7 Classifiers Random Forest Features using Topic Modelling: LDA (Blei et al, 2003) Varied number of topics (10-100) Deep Belief Network Pre-trained word embeddings Document Representation Paragraph vectors (Le and Mikolov, 2014) Classes: Top 10 sections in a category based on frequency DocEng2015 7

Content Summarization Query Formulation Stub title + keyphrases (introduction) Google Search Clean webpage Retrieve passages Classification Assign to sections Threshold > = 0.

8 Content Summarization Query Formulation Stub title + keyphrases (introduction) Google Search Clean webpage Retrieve passages Classification Assign to sections Threshold > = 0.5 LexRank [Erkan and Radev, 2004] Top 5 sentences Sentence compression Clarke and Lapata (2008): Drop/keep words using ILP formulation Original: The aim is to give councils some control over the future growth of second homes Output: The aim is to give councils some control over the growth of homes DocEng2015 8

Experiments Dataset: Limited to one category of articles on Diseases 6000 articles and 560 stubs Classes: Prevention, Treatment, Prognosis, Pathophysiology, Classification,

9 Experiments Dataset: Limited to one category of articles on Diseases 6000 articles and 560 stubs Classes: Prevention, Treatment, Prognosis, Pathophysiology, Classification, Causes, Diagnosis, History, Epidemiology and Symptoms LDA Features: MALLET topic modelling tool Classification F-Scores (10-fold CV): Best scores using 40 topics DocEng2015 9

10 Content Generation Evaluation Reconstructing existing articles (80 articles) Appending to stubs DocEng

11 Example A starting point for authors! DocEng

12 Conclusions and Future Work Built a classifier to classify and assign content into appropriate sections Significantly lowered rejection rate on Wikipedia Minor cases of copyright were tackled using manual editing Long paper in ACL 2015, Beijing, China WikiKreator: Improving Wikipedia Stubs Automatically Developed a novel abstractive summarization technique for summary generation Articles can belong to multiple categories Bottom-up approach Paraphrasing DocEng

13 13

WikiKreator: Improving Wikipedia Stubs Automatically

WikiKreator: Improving Wikipedia Stubs Automatically Siddhartha Banerjee The Pennsylvania State University Information Sciences and Technology University Park, PA, USA sub253@ist.psu.edu Prasenjit Mitra