Exploring archives with probabilistic models: Topic modelling for the European Commission Archives

Size: px

Start display at page:

Download "Exploring archives with probabilistic models: Topic modelling for the European Commission Archives"

Marvin Lindsey
5 years ago
Views:

1 Exploring archives with probabilistic models: Topic modelling for the European Commission Archives Simon Hengchen, Mathias Coeckelbergs, Seth van Hooland, Ruben Verborgh & Thomas Steiner Université libre de Bruxelles - ReSIC Ghent University - iminds Google Germany {shengche;mcoeckel;svhoolan}@ulb.ac.be ruben.verborgh@ugent.be;tomayac@google.com hengchen.net

2 - Digitisation initiatives for archives have created huge textual corpora

3 - Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata)

4 - Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata) - As such, they are useless and only serve for data preservation

5 - Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata) - As such, they are useless and only serve for data preservation - We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents

Topic Modelling Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003.

6 Topic Modelling Blei, D.M., Ng, A.Y. and Jordan, M.I., Latent dirichlet allocation. the Journal of machine Learning research, 3, pp

7 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens

8 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus

9 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc

10 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc

11 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc - Manually inspect the documents

12 Results: - 100% agreement between non-expert annotators

13 Results: - 100% agreement between non-expert annotators - All documents matched to a topic are correctly matched

14 Results: - 100% agreement between non-expert annotators - All documents matched to a topic are correctly matched - No specific terms could be attributed to 30% of the clusters of salient tokens

15 Discussion and improvement: - No specific terms could be attributed to 30% of the clusters of salient tokens, because : - OCR noise - Too large k-parameter in LDA - Non-expert knowledge of EU-related matters

16 Future work: - Experiment with smaller k-parameters - Expert annotation - Harvesting the multilingual component - implementation

17 Acknowledgments Simon Hengchen is supported by Belgian Science Policy (BELSPO) grant n BR/121/A3/TIC-BELGIUM.

Computing Similarity between Cultural Heritage Items using Multimodal Features

Computing Similarity between Cultural Heritage Items using Multimodal Features Nikolaos Aletras and Mark Stevenson Department of Computer Science, University of Sheffield Could the combination of textual