Quality-Based Automatic Classification for Presentation Slides

Similar documents
Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Extraction of Semantic Text Portion Related to Anchor Link

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Extracting Visual Snippets for Query Suggestion in Collaborative Web Search

Topic Classification in Social Media using Metadata from Hyperlinked Objects

User Guide Written By Yasser EL-Manzalawy

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

CADIAL Search Engine at INEX

A cocktail approach to the VideoCLEF 09 linking task

Northeastern University in TREC 2009 Million Query Track

Enhancing Web Page Skimmability

Query Disambiguation from Web Search Logs

SNUMedinfo at TREC CDS track 2014: Medical case-based retrieval task

Discovering Information through Summon:

Query Phrase Expansion using Wikipedia for Patent Class Search

Retrieval Evaluation. Hongning Wang

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Automatic Generation of Query Sessions using Text Segmentation

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Adaptive Supersampling Using Machine Learning Techniques

Reducing Redundancy with Anchor Text and Spam Priors

Query Expansion using Wikipedia and DBpedia

Automatic Query Type Identification Based on Click Through Information

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Query Reformulation for Clinical Decision Support Search

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

An Improvement of Search Results Access by Designing a Search Engine Result Page with a Clustering Technique

Toward Interlinking Asian Resources Effectively: Chinese to Korean Frequency-Based Machine Translation System

A Hierarchical Domain Model-Based Multi-Domain Selection Framework for Multi-Domain Dialog Systems

Linking Entities in Chinese Queries to Knowledge Graph

Individualized Error Estimation for Classification and Regression Models

An Empirical Study of Lazy Multilabel Classification Algorithms

A Two-Level Adaptive Visualization for Information Access to Open-Corpus Educational Resources

Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection

Semantic Annotation of Web Resources Using IdentityRank and Wikipedia

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

Chapter 50 Tracing Related Scientific Papers by a Given Seed Paper Using Parscit

Objectives of the Webometrics Ranking of World's Universities (2016)

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CREATING A MULTIMEDIA NARRATIVE WITH 1001VOICES

On the automatic classification of app reviews

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Split and Merge Based Story Segmentation in News Videos

A Linear Regression Model for Assessing the Ranking of Web Sites Based on Number of Visits

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

Rank Measures for Ordering

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Documents 1. INTRODUCTION. IJCTA, 9(10), 2016, pp International Science Press. *MG Thushara **Nikhil Dominic

Weka ( )

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

ResPubliQA 2010

Microsoft PowerPoint 2013 Module

The Role of Biomedical Dataset in Classification

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

Web Spam Challenge 2008

A Machine Learning Approach for Displaying Query Results in Search Engines

Drawing Bipartite Graphs as Anchored Maps

Author Verification: Exploring a Large set of Parameters using a Genetic Algorithm

Evaluation and Design Issues of Nordic DC Metadata Creation Tool

Semantic Website Clustering

Robust Relevance-Based Language Models

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

Investigating the Effects of User Age on Readability

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

Improving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking

Information Retrieval

Comparative Analysis of Clicks and Judgments for IR Evaluation

The Data Mining Application Based on WEKA: Geographical Original of Music

University of Santiago de Compostela at CLEF-IP09

BMS2062 Introduction to Bioinformatics. Lecture outline. What is multimedia? Use of information technology and telecommunications in bioinformatics

Using Coherence-based Measures to Predict Query Difficulty

As long as anyone in the classroom is working on the exam, there will be no talking.

Scientific Literature Retrieval based on Terminological Paraphrases using Predicate Argument Tuple

Integrated Framework for Keyword-based Text Data Collection and Analysis

USC Viterbi School of Engineering

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information

Lightweight caching strategy for wireless content delivery networks

Predicting connection quality in peer-to-peer real-time video streaming systems

A Language Independent Author Verifier Using Fuzzy C-Means Clustering

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Filtering Bug Reports for Fix-Time Analysis

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Exploration of metadata change in a digital repository. Oksana L. Zavalina, Assistant Professor Priya Kizhakkethil, PhD student

Automated Online News Classification with Personalization

Image Similarity Measurements Using Hmok- Simrank

Accessing information about Linked Data vocabularies with vocab.cc

Building and Annotating Corpora of Collaborative Authoring in Wikipedia

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Transcription:

Quality-Based Automatic Classification for Presentation Slides Seongchan Kim, Wonchul Jung, Keejun Han, Jae-Gil Lee, and Mun Y. Yi Dept. of Knowledge Service Engineering, KAIST, Korea {sckim,wonchul.jung,keejun.han,jaegil,munyi@kaist.ac.kr} Abstract. Computerized presentation slides have become essential for effective business meetings, classroom discussions, and even general events and occasions. With the exploding number of online resources and materials, locating the slides of high quality is a daunting challenge. In this study, we present a new, comprehensive framework of information quality developed specifically for computerized presentation slides on the basis of a user study involving 60 university students from two universities and extensive coding analysis, and explore the possibility of automatically detecting the information quality of slides. Using the classifications made by human annotators as the golden standard, we compare and evaluate the performance of alternative information quality features and dimensions. The experimental results support the validity of the proposed approach in automatically assessing and classifying the information quality of slides. Keywords: Information Quality (IQ), Presentation Slides, Classification. 1 Introduction Computerized presentation slides have become a popular and valuable medium for various occasions such as business meetings, academic lectures, formal presentations, and multi-purpose talks. Online services focused on presentation slides, SlideShare 1 and CourseShare 2 to name a few, offer the ability to search and share computerized presentation slides on the Internet. Millions of slides are available on the Web and the number is growing continuously. However, most of the slide service platforms suffer from the problem of discerning the quality of available slides. This problem is acute and getting worse as the number of slides is continuously increasing. Further, on most platforms anyone can upload their slides. An automated classification approach, if effective, offers several benefits: 1) users are directed to select high quality slides among a group of similar slides, and 2) the assessed quality of a slide can be integrated into the searching and ranking strategies of slide-specialized search engines. For instance, none of the currently available slide search engines (e.g., Slide- Share and CourseShare) support automated slide categorization or ranking by quality. 1 2 http://www.slideshare.net http://www.courseshare.org M. de Rijke et al. (Eds.): ECIR 2014, LNCS 8416, pp. 638 643, 2014. Springer International Publishing Switzerland 2014

Quality-Based Automatic Classification for Presentation Slides 639 When issuing a query to these slide-specialized search engines, end-users will get only a list of keyword-relevant slides, with no information on slide quality. The automated classification of high quality slides is an important issue for the advancement of search engines focused on presentation slides. In the area of information retrieval, measuring information quality (IQ) for Web documents and for Wikipedia has recently been attempted by several studies. A set of quality features for Web documents for improving retrieval performance [1, 3] have been suggested, and a different set of quality indicators for better ranking and automatic quality classification of articles on Wikipedia [2, 4] have also been reported. However, these quality indicators for Web documents and Wikipedia are inappropriate for presentation slides because they overlook the importance of the representational aspects. To the best of our knowledge, this study is the first attempt to define quality metrics for automatic classification of presentation slides. In this study, we consider only lecture slides, as they are the most popular. The key contributions of this paper are: 1) we investigate presentation slide characteristics and propose a new, comprehensive framework of information quality developed specifically for presentation slides on the basis of a user study, and 2) we assess the validity of the identified quality features and dimensions for the task of automatic quality assessment. 2 Related Work IQ related research work has recently been receiving considerable attention in the information retrieval research community; however, no research has yet been attempted that has considered slide quality. Several studies on the classification and ranking of Web documents [1, 3] and Wikipedia articles [2, 4] have been reported. For Web documents, Zhou and Croft [3] devised a document quality model using collection-document distance and information-to-noise ratio to promote the quality of contents in the TREC corpus. More recently, Bendersky et al. [1] utilized various quality features related with content, layout, readability, and ease of navigation, including the number of visible terms, the average length of the terms, the fraction of anchor text on the page, and the depth of the URL path. They reported a significant improvement in the retrieval performance with ClueWeb and GOV2 corpus. For Wikipedia, Hu et al. [4] suggested the PEERREVIEW model, which considers the review behavior. The PROBREVIEW model was proposed to extend the PEERREVIEW model with partial reviewership of contributors. These models were used to determine Wikipedia article quality with features including the number of authors per article, reviewers per article, words per article, and so on. The proposed models were evaluated and found to be effective in article ranking. Dalip et al. [2] tried to classify Wikipedia articles with the following quality indicators: 1) text features (length, structure, style, readability, etc.), 2) review features (review count, reviews per day/user, article age, etc.), and 3) network features (page rank, in/out degree, etc.). They demonstrated that these structure and style features are effective in quality categorization.

640 S. Kim et al. We adopt several content features such as entropy and readability [1, 2] from previous studies in our experiment. However, these features are not discriminative enough to determine slide quality, because they overlook useful features about representational aspects of slides. Thus we suggest new features including font color, font size and the number of bold words for representational clarity. To the best of our knowledge, this is the first study of automatic quality classification of slides. 3 Quality Features of Presentation Slides In this section, we present the quality features developed in order to determine presentation slide quality. Table 1 shows the 28 quality indicators derived for the five IQ dimensions, whose definitions were previously presented in [5, 6]. Table 1. Description of extracted quality features Dimension Indicator Description numslides Number of slides numterms Number of terms in the slides [1, 2] avgnumterms Number of terms per slide [1, 2] numimgs Number of images [2] Informativeness avgnumimgs Number of images per slide [2] (I) preexample Presence of example numexamples Number of examples pretable Presence of table [1] numtables Number of tables [1] preleaobj Presence of learning objective Cohesiveness (C) entropy Entropy of texts in the slide [1] numstops Number of stopwords [1] Readability (R) fracstops Stopword / non-stopword ratio [1] avgtermlen Average term length of texts [1] Ease of pretablecnts Presence of table of contents Navigation (EN) preslidenums Presence of slide numbers numbolds Number of bolds numitalics Number of italics numunderlines Number of underlines numshadows Number of shadows sumhighlights Sum of bolds, italics, underlines, and shadows Representational numrichtexts Number of styled text blocks Clarity (RC) numfontsizes Number of font sizes avgfontsize Average size of fonts numfontnames Number of font names numfontcolors Number of font colors numlinespaces Number of line spaces avglinespace Average line space

Quality-Based Automatic Classification for Presentation Slides 641 We adopt some quality features from previous studies [1, 2]; the others are inspired by our own user study, which was conducted to determine the quality criteria of presentation slides. Our user study involved 60 students, recruited from two universities in order to balance their backgrounds and individual characteristics, each of which was asked to view five slides and think aloud while comparing the given slides. The verbal statements were all recorded and transcribed. Through extensive coding analysis, we identified the criteria of IQ and determined their dimensions as shown in Table 1. The entropy of document D is computed over the individual document terms as in Equation (1)., where, (1), 4 Experiments We automatically conducted a preliminary classification for high, fair, and low quality lecture slides using the proposed 28 features from the five IQ dimensions. We randomly collected 200 MS PowerPoint presentation slides from SlideShare 1 in two courses: data mining and computer network. We manually annotated the 200 slides according to quality by hiring six graduates who had completed those courses successfully. Three annotators were assigned per course; as a result, each slide was judged by three annotators. Annotators were instructed to classify a given slide into only one class out of the three (high, fair, and low), considering all dimensions of quality such as informativeness, representational clarity, and readability etc. The inter-annotator agreement among the three annotators was κ = 0.67, which is considered to indicate substantial agreement according to Fleiss kappa [7]. Finally, we obtained 178 slides that had yielded agreement on labeling quality (high, fair, and low) from more than two annotators. These 178 slides (high: 55, fair: 83, and low: 40) were used for our classification. In order to extract the proposed quality features of the slides, shown in Table 1, we used the Apache POI 3, which is a Java API for reading and writing Microsoft Office files such as Word, Excel, and PowerPoint. We extracted the features from textual contents, file metadata, layout, etc., of PowerPoint files (ppt/pptx). The classification in three classes: high, fair, and low was conducted using 10-fold cross validation. We used SVM and Logistic Regression (LR), which have been widely adopted for classification, in the Weka toolkit [8]. The default parameter values given in Weka were chosen for our experiment. We report on three measures: precision (P), recall (R), and F-1 score (F1) (micro-averaged). The results of classification are as shown in Table 2. Performance was measured by adding features of each dimension. The results clearly reveal the effectiveness of the proposed features and show that there is a little difference between SVM and LR. 3 http://poi.apache.org/

642 S. Kim et al. Table 2. Performance of classification by adding individual dimensions Features SVM LR P R F P R F RC 0.402 0.479 0.371 0.509 0.515 0.500 RC+I 0.547 0.533 0.476 0.551 0.550 0.549 RC+I+EN 0.602 0.592 0.577 0.592 0.586 0.586 RC+I+EN+R 0.619 0.604 0.588 0.617 0.615 0.614 All included 0.622 0.609 0.596 0.600 0.598 0.597 Table 3. Top 11 features by information gain (IG) Rank Features IG Dimension 1 numtables 0.112 Informativeness 2 numfontcolors 0.095 Representational Clarity 3 preslidenums 0.093 Ease of Navigation 4 numimgs 0.093 Informativeness 5 numitalics 0.088 Representational Clarity 6 numslides 0.087 Informativeness 7 numfontnames 0.077 Representational Clarity 8 pretable 0.034 Informativeness 9 pretablecnts 0.033 Ease of Navigation 10 preexample 0.031 Informativeness 11 preleaobj 0.003 Informativeness SVM with all features from all dimensions achieves the best performance of 0.596 in F1. With LR, the best performance of 0.614 in F1 is achieved when C (cohesiveness) was excluded. Performance is increased when features of each dimension are added, except C with LR. To analyze the individual feature importance, we computed information gain (IG) for each feature. The top 11 features by IG are reported in Table 3. The results show that 10 features among the proposed features are considerably discriminative for the classification. It should be noted that there is a significant drop between the 10 th and 11 th. Among the top 10 features, numtables is the most discriminant, followed by numfontcolors and preslidenums. The results imply that rich tables, font colors, images, italicized fonts, slides, and font faces, and existence of slide numbers, table, table of contents, and example are engaging characteristics with which high quality slides can be discerned reliably. Remarkably, numfontcolors, preslidenums, numitalics, numfontnames, and pretablecnts are distinctive and discriminative features for slides, even though these categories are not considered and used for other documents such as Web and Wikipedia in previous studies [1-4]. Furthermore, these results are contrary to previous reports, in which it has been seen that, in terms of quality measurement, readability features with stopword fraction and coverage are effective features for Web documents [1] and structure features related to the organization of the article such as sections, images, citations and links are useful for Wikipedia articles [2]. As for features about readability, they might not be effective for slides because most slides are written in a condensed form to summarize the contents.

Quality-Based Automatic Classification for Presentation Slides 643 It is clar that newly proposed features about representation in this study are effective in slides. These results also support the necessity of our study for development a different set of features for slides. 5 Concluding Remarks In this paper, we presented a new, comprehensive framework of information quality developed specifically for computerized presentation slides and reported the performance of automatically detecting the information quality of slides using the features captured by the framework. Although the study results may be considered not so highly strong, the study supports the validity of the proposed approach in automatically assessing and classifying the information quality of presentation slides. In our future work, we need to develop further salient features from the slide layout and content to improve the overall performance while considering other IQ dimensions such as consistency, completeness and appropriateness. Acknowledgements. "This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-0029185)." We thank the anonymous reviewers for the helpful comments. References 1. Bendersky, M., Croft, W.B., Diao, Y.: Quality-biased ranking of web documents. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining, pp. 95 104. ACM, Hong Kong (2011) 2. Dalip, D.H., Gonçalves, M.A., Cristo, M., Calado, P.: Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 295 304. ACM, Austin (2009) 3. Zhou, Y., Croft, W.B.: Document quality models for web ad hoc retrieval. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 331 332. ACM, Bremen (2005) 4. Hu, M., Lim, E.-P., Sun, A., Lauw, H.W., Vuong, B.-Q.: Measuring article quality in wikipedia: models and evaluation. In: Proceedings of the 16th ACM International Conference on Information and Knowledge Management, pp. 243 252. ACM, Lisbon (2007) 5. Stvilia, B., Gasser, L., Twidale, M.B., Smith, L.C.: A framework for information quality assessment. J. Am. Soc. Inf. Sci. Tec. 58, 1720 1733 (2007) 6. Alkhattabi, M., Neagu, D., Cullen, A.: Assessing information quality of e-learning systems: a web mining approach. Computers in Human Behavior 27, 862 873 (2011) 7. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 378 382 (1971) 8. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explorations Newsletter 11, 10 18 (2009)