Exploring archives with probabilistic models: Topic modelling for the European Commission Archives

Similar documents
Computing Similarity between Cultural Heritage Items using Multimodal Features

HECTOR research project

Clustering using Topic Models

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Text Document Clustering Using DPM with Concept and Feature Analysis

Topic Model Visualization with IPython

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Company Search When Documents are only Second Class Citizens

arxiv: v1 [cs.cl] 18 Jan 2015

Multimodal topic model for texts and images utilizing their embeddings

VisoLink: A User-Centric Social Relationship Mining

Harvesting Image Databases from The Web

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

Deliverable Final Data Management Plan

Deliverable Initial Data Management Plan

Parallelism for LDA Yang Ruan, Changsi An

jldadmm: A Java package for the LDA and DMM topic models

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling

Hierarchical Location and Topic Based Query Expansion

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

The MultilingualWeb-LT project

Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps

Latent Topic Model Based on Gaussian-LDA for Audio Retrieval

Machine Translation Research in META-NET

From Web Page Storage to Living Web Archives Thomas Risse

Weaving the Web(VTT) of Data

TEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation

EUROPEANA METADATA INGESTION , Helsinki, Finland

National Centre for Text Mining NaCTeM. e-science and data mining workshop

META-SHARE: An Open Resource Exchange Infrastructure for Stimulating Research and Innovation

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

A bipartite graph model for associating images and text

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

SuMACC Project s Corpus

Exploiting Conversation Structure in Unsupervised Topic Segmentation for s

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

Meeting researchers needs in mining web archives: the experience of the National Library of France

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London

How SPICE Language Modeling Works

Preservation Planning in the OAIS Model

Document Clustering using Correlation Preserving Indexing with Concept Analysis

JSEA: A Program Comprehension Tool Adopting LDA-based Topic Modeling

Semantic text features from small world graphs

A Topic Modeling Based Solution for Confirming Software Documentation Quality

A Novel Model for Semantic Learning and Retrieval of Images

Comparing Local Feature Descriptors in plsa-based Image Models

Nearest Neighbor with KD Trees

BHL-EUROPE: Biodiversity Heritage Library for Europe. Jana Hoffmann, Henning Scholz

Supporting a Locale of One: Global Content Delivery for the Individual

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok

Behavioral Data Mining. Lecture 18 Clustering

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

The Functional Extension Parser (FEP) A Document Understanding Platform

META-SHARE : the open exchange platform Overview-Current State-Towards v3.0

Interoperability & Archives in the European Commission

OpenAIRE Guidelines for Data Archive Managers 1.0 December 2012

Continuous Time Group Discovery in Dynamic Graphs

Composite Heuristic Algorithm for Clustering Text Data Sets

Package lda. February 15, 2013

Configuring Topic Models for Software Engineering Tasks in TraceLab

DRIVER Step One towards a Pan-European Digital Repository Infrastructure

Dimensionality Reduction for Text using Domain Knowledge

Large Crawls of the Web for Linguistic Purposes

Spatial Data on the Web

A Multilingual Social Media Linguistic Corpus

Digitising Special Collections Public-Private Partnerships at the KB and abroad

Nuno Freire National Library of Portugal Lisbon, Portugal

MSRA Columbus at GeoCLEF2007

Session Questions and Responses

arxiv: v1 [cs.ir] 31 Jul 2017

Integrate Multilingual Web Search Results using Cross-Lingual Topic Models

On the way to Language Resources sharing: principles, challenges, solutions

Support system for smartphone application development based on analysis of user reviews

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Semi-Supervised Learning of Visual Classifiers from Web Images and Text

A Query Expansion Method based on a Weighted Word Pairs Approach

Edit Categories and Editor Role Identification in Wikipedia

CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen,

Links, languages and semantics: linked data approaches in The European Library and Europeana. Valentine Charles, Nuno Freire & Antoine Isaac

Large Scale Behavioral Analytics via Topical Interaction

Spatial Latent Dirichlet Allocation

Big Data and Large Scale Machine Learning

Automatic Triage of Mental Health Forum Posts

Language Resources. Khalid Choukri ELRA/ELDA 55 Rue Brillat-Savarin, F Paris, France Tel Fax.

Visual Object Recognition

Digitisation of historic newspapers and voluntary digital deposit of newspaper pre-print files in the the National Library of Estonia

Overcoming the Memory Bottleneck in Distributed Training of Latent Variable Models of Text

Ranking models in Information Retrieval: A Survey

A Robust Number Parser based on Conditional Random Fields

The DIGMAP Virtual Digital Library

Deduced Social Networks for Educational Portal

How can CLARIN archive and curate my resources?

Conference of Directors of National Libraries in Asia and Oceania. Hanoi, 20 April 2009

Some challenges ahead for the Open Language Archives Community

Transcription:

Exploring archives with probabilistic models: Topic modelling for the European Commission Archives Simon Hengchen, Mathias Coeckelbergs, Seth van Hooland, Ruben Verborgh & Thomas Steiner Université libre de Bruxelles - ReSIC Ghent University - iminds Google Germany {shengche;mcoeckel;svhoolan}@ulb.ac.be ruben.verborgh@ugent.be;tomayac@google.com hengchen.net

- Digitisation initiatives for archives have created huge textual corpora

- Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata)

- Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata) - As such, they are useless and only serve for data preservation

- Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata) - As such, they are useless and only serve for data preservation - We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents

Topic Modelling Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3, pp.993-1022.

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc http://eurovoc.europa.eu/852

We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc - Manually inspect the documents

Results: - 100% agreement between non-expert annotators

Results: - 100% agreement between non-expert annotators - All documents matched to a topic are correctly matched

Results: - 100% agreement between non-expert annotators - All documents matched to a topic are correctly matched - No specific terms could be attributed to 30% of the clusters of salient tokens

Discussion and improvement: - No specific terms could be attributed to 30% of the clusters of salient tokens, because : - OCR noise - Too large k-parameter in LDA - Non-expert knowledge of EU-related matters

Future work: - Experiment with smaller k-parameters - Expert annotation - Harvesting the multilingual component - implementation

Acknowledgments Simon Hengchen is supported by Belgian Science Policy (BELSPO) grant n BR/121/A3/TIC-BELGIUM.