Exploring archives with probabilistic models: Topic modelling for the European Commission Archives Simon Hengchen, Mathias Coeckelbergs, Seth van Hooland, Ruben Verborgh & Thomas Steiner Université libre de Bruxelles - ReSIC Ghent University - iminds Google Germany {shengche;mcoeckel;svhoolan}@ulb.ac.be ruben.verborgh@ugent.be;tomayac@google.com hengchen.net
- Digitisation initiatives for archives have created huge textual corpora
- Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata)
- Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata) - As such, they are useless and only serve for data preservation
- Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata) - As such, they are useless and only serve for data preservation - We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents
Topic Modelling Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3, pp.993-1022.
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc http://eurovoc.europa.eu/852
We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc - Manually inspect the documents
Results: - 100% agreement between non-expert annotators
Results: - 100% agreement between non-expert annotators - All documents matched to a topic are correctly matched
Results: - 100% agreement between non-expert annotators - All documents matched to a topic are correctly matched - No specific terms could be attributed to 30% of the clusters of salient tokens
Discussion and improvement: - No specific terms could be attributed to 30% of the clusters of salient tokens, because : - OCR noise - Too large k-parameter in LDA - Non-expert knowledge of EU-related matters
Future work: - Experiment with smaller k-parameters - Expert annotation - Harvesting the multilingual component - implementation
Acknowledgments Simon Hengchen is supported by Belgian Science Policy (BELSPO) grant n BR/121/A3/TIC-BELGIUM.