Exploring archives with probabilistic models: Topic modelling for the European Commission Archives

Size: px
Start display at page:

Download "Exploring archives with probabilistic models: Topic modelling for the European Commission Archives"

Transcription

1 Exploring archives with probabilistic models: Topic modelling for the European Commission Archives Simon Hengchen, Mathias Coeckelbergs, Seth van Hooland, Ruben Verborgh & Thomas Steiner Université libre de Bruxelles - ReSIC Ghent University - iminds Google Germany {shengche;mcoeckel;svhoolan}@ulb.ac.be ruben.verborgh@ugent.be;tomayac@google.com hengchen.net

2 - Digitisation initiatives for archives have created huge textual corpora

3 - Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata)

4 - Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata) - As such, they are useless and only serve for data preservation

5 - Digitisation initiatives for archives have created huge textual corpora - Those corpora are often of bad quality (OCR), and of unknown content (no metadata) - As such, they are useless and only serve for data preservation - We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents

6 Topic Modelling Blei, D.M., Ng, A.Y. and Jordan, M.I., Latent dirichlet allocation. the Journal of machine Learning research, 3, pp

7 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens

8 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus

9 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc

10 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc

11 We postulate it is possible to use LDA and an existing controlled vocabulary to a/ create metadata and b/ retrieve documents. How? - Use LDA to generate representative tokens - Intellectually deduce the topics present in the corpus - Match the topics with EuroVoc - Manually inspect the documents

12 Results: - 100% agreement between non-expert annotators

13 Results: - 100% agreement between non-expert annotators - All documents matched to a topic are correctly matched

14 Results: - 100% agreement between non-expert annotators - All documents matched to a topic are correctly matched - No specific terms could be attributed to 30% of the clusters of salient tokens

15 Discussion and improvement: - No specific terms could be attributed to 30% of the clusters of salient tokens, because : - OCR noise - Too large k-parameter in LDA - Non-expert knowledge of EU-related matters

16 Future work: - Experiment with smaller k-parameters - Expert annotation - Harvesting the multilingual component - implementation

17 Acknowledgments Simon Hengchen is supported by Belgian Science Policy (BELSPO) grant n BR/121/A3/TIC-BELGIUM.

Computing Similarity between Cultural Heritage Items using Multimodal Features

Computing Similarity between Cultural Heritage Items using Multimodal Features Computing Similarity between Cultural Heritage Items using Multimodal Features Nikolaos Aletras and Mark Stevenson Department of Computer Science, University of Sheffield Could the combination of textual

More information

HECTOR research project

HECTOR research project HECTOR research project 4-year project 2014 2018 3 expert fields Law Information and Communication Sciences Archival sciences 4 partners Université de Namur CRIDS Université Libre de Bruxelles State Archives

More information

Clustering using Topic Models

Clustering using Topic Models Clustering using Topic Models Compiled by Sujatha Das, Cornelia Caragea Credits for slides: Blei, Allan, Arms, Manning, Rai, Lund, Noble, Page. Clustering Partition unlabeled examples into disjoint subsets

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Multimodal Medical Image Retrieval based on Latent Topic Modeling Multimodal Medical Image Retrieval based on Latent Topic Modeling Mandikal Vikram 15it217.vikram@nitk.edu.in Suhas BS 15it110.suhas@nitk.edu.in Aditya Anantharaman 15it201.aditya.a@nitk.edu.in Sowmya Kamath

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Topic Model Visualization with IPython

Topic Model Visualization with IPython Topic Model Visualization with IPython Sergey Karpovich 1, Alexander Smirnov 2,3, Nikolay Teslya 2,3, Andrei Grigorev 3 1 Mos.ru, Moscow, Russia 2 SPIIRAS, St.Petersburg, Russia 3 ITMO University, St.Petersburg,

More information

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews Necmiye Genc-Nayebi and Alain Abran Department of Software Engineering and Information Technology, Ecole

More information

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Dong Han and Kilian Stoffel Information Management Institute, University of Neuchâtel Pierre-à-Mazel 7, CH-2000 Neuchâtel,

More information

Company Search When Documents are only Second Class Citizens

Company Search When Documents are only Second Class Citizens Company Search When Documents are only Second Class Citizens Daniel Blank, Sebastian Boosz, and Andreas Henrich University of Bamberg, D-96047 Bamberg, Germany, firstname.lastname@uni-bamberg.de, WWW home

More information

arxiv: v1 [cs.cl] 18 Jan 2015

arxiv: v1 [cs.cl] 18 Jan 2015 Workshop on Knowledge-Powered Deep Learning for Text Mining (KPDLTM-2014) arxiv:1501.04325v1 [cs.cl] 18 Jan 2015 Lars Maaloe DTU Compute, Technical University of Denmark (DTU) B322, DK-2800 Lyngby Morten

More information

Multimodal topic model for texts and images utilizing their embeddings

Multimodal topic model for texts and images utilizing their embeddings Multimodal topic model for texts and images utilizing their embeddings Nikolay Smelik, smelik@rain.ifmo.ru Andrey Filchenkov, afilchenkov@corp.ifmo.ru Computer Technologies Lab IDP-16. Barcelona, Spain,

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

Harvesting Image Databases from The Web

Harvesting Image Databases from The Web Abstract Harvesting Image Databases from The Web Snehal M. Gaikwad G.H.Raisoni College of Engg. & Mgmt.,Pune,India *gaikwad.snehal99@gmail.com Snehal S. Pathare G.H.Raisoni College of Engg. & Mgmt.,Pune,India

More information

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL Shwetha S P 1 and Alok Ranjan 2 Visvesvaraya Technological University, Belgaum, Dept. of Computer Science and Engineering, Canara

More information

Deliverable Final Data Management Plan

Deliverable Final Data Management Plan EU H2020 Research and Innovation Project HOBBIT Holistic Benchmarking of Big Linked Data Project Number: 688227 Start Date of Project: 01/12/2015 Duration: 36 months Deliverable 8.5.3 Final Data Management

More information

Deliverable Initial Data Management Plan

Deliverable Initial Data Management Plan EU H2020 Research and Innovation Project HOBBIT Holistic Benchmarking of Big Linked Data Project Number: 688227 Start Date of Project: 01/12/2015 Duration: 36 months Deliverable 8.5.1 Initial Data Management

More information

Parallelism for LDA Yang Ruan, Changsi An

Parallelism for LDA Yang Ruan, Changsi An Parallelism for LDA Yang Ruan, Changsi An (yangruan@indiana.edu, anch@indiana.edu) 1. Overview As parallelism is very important for large scale of data, we want to use different technology to parallelize

More information

jldadmm: A Java package for the LDA and DMM topic models

jldadmm: A Java package for the LDA and DMM topic models jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical

More information

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Natthakul Pingclasai Department of Computer Engineering Kasetsart University Bangkok, Thailand Email: b5310547207@ku.ac.th Hideaki

More information

Hierarchical Location and Topic Based Query Expansion

Hierarchical Location and Topic Based Query Expansion Hierarchical Location and Topic Based Query Expansion Shu Huang 1 Qiankun Zhao 2 Prasenjit Mitra 1 C. Lee Giles 1 Information Sciences and Technology 1 AOL Research Lab 2 Pennsylvania State University

More information

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network Roberto Navigli, Simone Paolo Ponzetto What is BabelNet a very large, wide-coverage multilingual

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

The MultilingualWeb-LT project

The MultilingualWeb-LT project Multilingualism & Drupal The MultilingualWeb-LT project Denver, CO. March 20th, 2012 Seite 1 Cocomore essentials Agency for integrated communication and IT services 120 employees Offices in Germany and

More information

Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps

Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps Proceedings of the Twenty-Second International FLAIRS Conference (2009) Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps Jeremy R. Millar and Gilbert L. Peterson

More information

Latent Topic Model Based on Gaussian-LDA for Audio Retrieval

Latent Topic Model Based on Gaussian-LDA for Audio Retrieval Latent Topic Model Based on Gaussian-LDA for Audio Retrieval Pengfei Hu, Wenju Liu, Wei Jiang, and Zhanlei Yang National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy

More information

Machine Translation Research in META-NET

Machine Translation Research in META-NET Machine Translation Research in META-NET Jan Hajič Institute of Formal and Applied Linguistics Charles University in Prague, CZ hajic@ufal.mff.cuni.cz Solutions for Multilingual Europe Budapest, Hungary,

More information

From Web Page Storage to Living Web Archives Thomas Risse

From Web Page Storage to Living Web Archives Thomas Risse From Web Page Storage to Living Web Archives Thomas Risse JISC, the DPC and the UK Web Archiving Consortium Workshop British Library, London, 21.7.2009 1 Agenda Web Crawlingtoday& Open Issues LiWA Living

More information

Weaving the Web(VTT) of Data

Weaving the Web(VTT) of Data Weaving the Web(VTT) of Data Thomas Steiner,1 Hannes Mühleisen,2 Ruben Verborgh,3 Pierre-Antoine Champin,1 Benoît Encelle,1 and Yannick Prié4 1 CNRS, Université de Lyon LIRIS, UMR5205 Université Lyon 1,

More information

TEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation

TEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation TEI, METS and ALTO, why we need all of them Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation Agenda Introduction Problem statement Proposed solution Starting point Mass digitisation

More information

EUROPEANA METADATA INGESTION , Helsinki, Finland

EUROPEANA METADATA INGESTION , Helsinki, Finland EUROPEANA METADATA INGESTION 20.11.2012, Helsinki, Finland As of now, Europeana has: 22.322.604 Metadata (related to a digital record) in CC0 3.698.807 are in the Public Domain 697.031 Digital Objects

More information

National Centre for Text Mining NaCTeM. e-science and data mining workshop

National Centre for Text Mining NaCTeM. e-science and data mining workshop National Centre for Text Mining NaCTeM e-science and data mining workshop John Keane Co-Director, NaCTeM john.keane@manchester.ac.uk School of Informatics, University of Manchester What is text mining?

More information

META-SHARE: An Open Resource Exchange Infrastructure for Stimulating Research and Innovation

META-SHARE: An Open Resource Exchange Infrastructure for Stimulating Research and Innovation META-SHARE: An Open Resource Exchange Infrastructure for Stimulating Research and Innovation Stelios Piperidis Athena RC, Greece spip@ilsp.athena-innovation.gr Solutions for Multilingual Europe Budapest,

More information

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license Inge Van Nieuwerburgh OpenAIRE NOAD Belgium Tools&Services OpenAIRE EUDAT can be reused under the CC BY license Open Access Infrastructure for Research in Europe www.openaire.eu Research Data Services,

More information

A bipartite graph model for associating images and text

A bipartite graph model for associating images and text A bipartite graph model for associating images and text S H Srinivasan Technology Research Group Yahoo, Bangalore, India. shs@yahoo-inc.com Malcolm Slaney Yahoo Research Lab Yahoo, Sunnyvale, USA. malcolm@ieee.org

More information

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval Mark Baillie 1, Mark J. Carman 2, and Fabio Crestani 2 1 CIS Dept., University of Strathclyde, Glasgow, UK mb@cis.strath.ac.uk

More information

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Julia Jürgens, Sebastian Kastner, Christa Womser-Hacker, and Thomas Mandl University of Hildesheim,

More information

SuMACC Project s Corpus

SuMACC Project s Corpus SuMACC Project s Corpus a Topic-based Query Extension Approach to Retrieve Multimedia Documents Mohamed Morchid, Richard Dufour, Usman Niaz, Francis Bouvier, Clément de Groc, Claude de Loupy, Georges Linarès,

More information

Exploiting Conversation Structure in Unsupervised Topic Segmentation for s

Exploiting Conversation Structure in Unsupervised Topic Segmentation for  s Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails Shafiq Joty, Giuseppe Carenini, Gabriel Murray, Raymond Ng University of British Columbia Vancouver, Canada EMNLP 2010 1

More information

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates Alina Karakanta, Mihaela Vela, Elke Teich Department of Language Science and Technology, Saarland University Outline Introduction

More information

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases LIDER Survey Overview Participant profile (organisation type, industry sector) Relevant use-cases Discovering and extracting information Understanding opinion Content and data (Data Management) Monitoring

More information

Meeting researchers needs in mining web archives: the experience of the National Library of France

Meeting researchers needs in mining web archives: the experience of the National Library of France Meeting researchers needs in mining web archives: the experience of the National Library of France Sara Aubry, IT Department Peter Stirling, Legal Deposit Department Bibliothèque nationale de France LIBER

More information

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 Name: You may consult your notes and/or your textbook. This is a 75 minute, in class exam. If there is information missing in any of the question

More information

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London Interpreting Document Collections with Topic Models Nikolaos Aletras University College London Acknowledgements Mark Stevenson, Sheffield Tim Baldwin, Melbourne Jey Han Lau, IBM Research Talk Outline Introduction

More information

How SPICE Language Modeling Works

How SPICE Language Modeling Works How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated

More information

Preservation Planning in the OAIS Model

Preservation Planning in the OAIS Model Preservation Planning in the OAIS Model Stephan Strodl and Andreas Rauber Institute of Software Technology and Interactive Systems Vienna University of Technology {strodl, rauber}@ifs.tuwien.ac.at Abstract

More information

Document Clustering using Correlation Preserving Indexing with Concept Analysis

Document Clustering using Correlation Preserving Indexing with Concept Analysis IJCST Vo l. 4, Is s u e 1, Ja n - Ma r c h 2013 ISSN : 0976-8491 (Online) ISSN : 2229-4333 (Print) Document Clustering using Correlation Preserving Indexing with Concept Analysis 1 M. Mohanasundari, 2

More information

JSEA: A Program Comprehension Tool Adopting LDA-based Topic Modeling

JSEA: A Program Comprehension Tool Adopting LDA-based Topic Modeling JSEA: A Program Comprehension Tool Adopting LDA-based Topic Modeling Tianxia Wang School of Software Engineering Tongji University China Yan Liu School of Software Engineering Tongji University China Abstract

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

A Topic Modeling Based Solution for Confirming Software Documentation Quality

A Topic Modeling Based Solution for Confirming Software Documentation Quality A Topic Modeling Based Solution for Confirming Software Documentation Quality Nouh Alhindawi 1 Faculty of Sciences and Information Technology, JADARA UNIVERSITY Obaida M. Al-Hazaimeh 2 Department of Information

More information

A Novel Model for Semantic Learning and Retrieval of Images

A Novel Model for Semantic Learning and Retrieval of Images A Novel Model for Semantic Learning and Retrieval of Images Zhixin Li, ZhiPing Shi 2, ZhengJun Tang, Weizhong Zhao 3 College of Computer Science and Information Technology, Guangxi Normal University, Guilin

More information

Comparing Local Feature Descriptors in plsa-based Image Models

Comparing Local Feature Descriptors in plsa-based Image Models Comparing Local Feature Descriptors in plsa-based Image Models Eva Hörster 1,ThomasGreif 1, Rainer Lienhart 1, and Malcolm Slaney 2 1 Multimedia Computing Lab, University of Augsburg, Germany {hoerster,lienhart}@informatik.uni-augsburg.de

More information

Nearest Neighbor with KD Trees

Nearest Neighbor with KD Trees Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest

More information

BHL-EUROPE: Biodiversity Heritage Library for Europe. Jana Hoffmann, Henning Scholz

BHL-EUROPE: Biodiversity Heritage Library for Europe. Jana Hoffmann, Henning Scholz Nimis P. L., Vignes Lebbe R. (eds.) Tools for Identifying Biodiversity: Progress and Problems pp. 43-48. ISBN 978-88-8303-295-0. EUT, 2010. BHL-EUROPE: Biodiversity Heritage Library for Europe Jana Hoffmann,

More information

Supporting a Locale of One: Global Content Delivery for the Individual

Supporting a Locale of One: Global Content Delivery for the Individual Prof Vinny Wade Supporting a Locale of One: Global Content Delivery for the Individual Trinity College, Dublin University Deputy Director, CNGL Outline Challenge: Changing Face of Content Dynamic Multillingual

More information

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok Kristina Lerman University of Southern California This lecture is partly based on slides prepared by Anon Plangprasopchok Social Web is a platform for people to create, organize and share information Users

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

The Functional Extension Parser (FEP) A Document Understanding Platform

The Functional Extension Parser (FEP) A Document Understanding Platform The Functional Extension Parser (FEP) A Document Understanding Platform Günter Mühlberger University of Innsbruck Department for German Language and Literature Studies Introduction A book is more than

More information

META-SHARE : the open exchange platform Overview-Current State-Towards v3.0

META-SHARE : the open exchange platform Overview-Current State-Towards v3.0 META-SHARE : the open exchange platform Overview-Current State-Towards v3.0 Stelios Piperidis Athena RC, Greece spip@ilsp.gr A Strategy for Multilingual Europe Brussels, Belgium, June 20/21, 2012 Co-funded

More information

Interoperability & Archives in the European Commission

Interoperability & Archives in the European Commission Interoperability & Archives in the European Commission By Natalia ARISTIMUÑO PEREZ Head of Interoperability Unit at Directorate- General for Informatics (DG DIGIT) European Commission High value added

More information

OpenAIRE Guidelines for Data Archive Managers 1.0 December 2012

OpenAIRE Guidelines for Data Archive Managers 1.0 December 2012 OpenAIRE Guidelines for Data Archive Managers 1.0 December 2012 OpenAIRE Guidelines for Data Archive Managers 1.0 Page 1 of 15 Contents Introduction... 3 Aim... 3 DataCite... 3 What s different... 3 Acknowledgements

More information

Continuous Time Group Discovery in Dynamic Graphs

Continuous Time Group Discovery in Dynamic Graphs Continuous Time Group Discovery in Dynamic Graphs Kurt T. Miller 1,2 tadayuki@cs.berkeley.edu 1 EECS University of California Berkeley, CA 94720 Tina Eliassi-Rad 2 eliassi@llnl.gov 2 Larence Livermore

More information

Composite Heuristic Algorithm for Clustering Text Data Sets

Composite Heuristic Algorithm for Clustering Text Data Sets Composite Heuristic Algorithm for Clustering Text Data Sets Nikita Nikitinsky, Tamara Sokolova and Ekaterina Pshehotskaya InfoWatch Nikita.Nikitinsky@infowatch.com, Tamara.Sokolova@infowatch.com, Ekaterina.Pshehotskaya@infowatch.com

More information

Package lda. February 15, 2013

Package lda. February 15, 2013 Package lda February 15, 2013 Type Package Title Collapsed Gibbs sampling methods for topic models. Version 1.3.2 Date 2012-05-22 Author Jonathan Chang Maintainer Jonathan Chang This

More information

Configuring Topic Models for Software Engineering Tasks in TraceLab

Configuring Topic Models for Software Engineering Tasks in TraceLab Configuring Topic Models for Software Engineering Tasks in TraceLab Bogdan Dit Annibale Panichella Evan Moritz Rocco Oliveto Massimiliano Di Penta Denys Poshyvanyk Andrea De Lucia TEFSE 13 San Francisco,

More information

DRIVER Step One towards a Pan-European Digital Repository Infrastructure

DRIVER Step One towards a Pan-European Digital Repository Infrastructure DRIVER Step One towards a Pan-European Digital Repository Infrastructure Norbert Lossau Bielefeld University, Germany Scientific coordinator of the Project DRIVER, Funded by the European Commission Consultation

More information

Dimensionality Reduction for Text using Domain Knowledge

Dimensionality Reduction for Text using Domain Knowledge Dimensionality Reduction for Text using Domain Knowledge Yi Mao and Krishnakumar Balasubramanian and Guy Lebanon Georgia Institute of Technology Abstract Text documents are complex high dimensional objects.

More information

Large Crawls of the Web for Linguistic Purposes

Large Crawls of the Web for Linguistic Purposes Large Crawls of the Web for Linguistic Purposes SSLMIT, University of Bologna Birmingham, July 2005 Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation

More information

Spatial Data on the Web

Spatial Data on the Web Spatial Data on the Web Tools and guidance for data providers Clemens Portele, Andreas Zahnen, Michael Lutz, Alexander Kotsev The European Commission s science and knowledge service Joint Research Centre

More information

A Multilingual Social Media Linguistic Corpus

A Multilingual Social Media Linguistic Corpus A Multilingual Social Media Linguistic Corpus Luis Rei 1,2 Dunja Mladenić 1,2 Simon Krek 1 1 Artificial Intelligence Laboratory Jožef Stefan Institute 2 Jožef Stefan International Postgraduate School 4th

More information

Digitising Special Collections Public-Private Partnerships at the KB and abroad

Digitising Special Collections Public-Private Partnerships at the KB and abroad Digitising Special Collections Public-Private Partnerships at the KB and abroad Marieke van Delft, KB, Keeper of Early Printed Collections / Project Leader ProQuest CERL - Londen, 30 October 2012 Koninklijke

More information

Nuno Freire National Library of Portugal Lisbon, Portugal

Nuno Freire National Library of Portugal Lisbon, Portugal Date submitted: 05/07/2010 UNIMARC in The European Library and related projects Nuno Freire National Library of Portugal Lisbon, Portugal E-mail: nuno.freire@bnportugal.pt Meeting: 148. UNIMARC WORLD LIBRARY

More information

MSRA Columbus at GeoCLEF2007

MSRA Columbus at GeoCLEF2007 MSRA Columbus at GeoCLEF2007 Zhisheng Li 1, Chong Wang 2, Xing Xie 2, Wei-Ying Ma 2 1 Department of Computer Science, University of Sci. & Tech. of China, Hefei, Anhui, 230026, P.R. China zsli@mail.ustc.edu.cn

More information

Session Questions and Responses

Session Questions and Responses Product: Topic: Audience: Updated: OpenText Image Crawler Webinar Questions ILTA February 10, 2015 Discover How to Make your Scanned Images Searchable with OpenText Image Crawler Session Questions and

More information

arxiv: v1 [cs.ir] 31 Jul 2017

arxiv: v1 [cs.ir] 31 Jul 2017 Familia: An Open-Source Toolkit for Industrial Topic Modeling Di Jiang, Zeyu Chen, Rongzhong Lian, Siqi Bao, Chen Li Baidu, Inc., China {iangdi,chenzeyu,lianrongzhong,baosiqi,lichen06}@baidu.com ariv:1707.09823v1

More information

Integrate Multilingual Web Search Results using Cross-Lingual Topic Models

Integrate Multilingual Web Search Results using Cross-Lingual Topic Models Integrate Multilingual Web Search Results using Cross-Lingual Topic Models Duo Ding Shanghai Jiao Tong University, Shanghai, 200240, P.R. China dingduo1@gmail.com Abstract With the thriving of the Internet,

More information

On the way to Language Resources sharing: principles, challenges, solutions

On the way to Language Resources sharing: principles, challenges, solutions On the way to Language Resources sharing: principles, challenges, solutions Stelios Piperidis ILSP, RC Athena, Greece spip@ilsp.gr Content on the Multilingual Web, 4-5 April, Pisa, 2011 Co-funded by the

More information

Support system for smartphone application development based on analysis of user reviews

Support system for smartphone application development based on analysis of user reviews 1,a) 1 1 / Support system for smartphone application development based on analysis of user reviews Yuichi Sei 1,a) Yasuyuki Tahara 1 Akihiko Ohsuga 1 Abstract: A number of smartphone applications have

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

Semi-Supervised Learning of Visual Classifiers from Web Images and Text

Semi-Supervised Learning of Visual Classifiers from Web Images and Text Semi-Supervised Learning of Visual Classifiers from Web Images and Text Nicholas Morsillo, Christopher Pal,2, Randal Nelson {morsillo,cpal,nelson}@cs.rochester.edu Department of Computer Science 2 Département

More information

A Query Expansion Method based on a Weighted Word Pairs Approach

A Query Expansion Method based on a Weighted Word Pairs Approach A Query Expansion Method based on a Weighted Word Pairs Approach Francesco Colace 1, Massimo De Santo 1, Luca Greco 1 and Paolo Napoletano 2 1 DIEM,University of Salerno, Fisciano,Italy, desanto@unisa,

More information

Edit Categories and Editor Role Identification in Wikipedia

Edit Categories and Editor Role Identification in Wikipedia Edit Categories and Editor Role Identification in Wikipedia Diyi Yang, Aaron Halfaker, Robert Kraut, Eduard Hovy Language Technologies Institute, Carnegie Mellon University {diyi,hovy}@cmu.edu Wikimedia

More information

CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen,

CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen, CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen, 2014-06-23 1 Overview CLARIN Portal Find data and tools 2 Overview CLARIN Portal Find data and tools 3 CLARIN

More information

Links, languages and semantics: linked data approaches in The European Library and Europeana. Valentine Charles, Nuno Freire & Antoine Isaac

Links, languages and semantics: linked data approaches in The European Library and Europeana. Valentine Charles, Nuno Freire & Antoine Isaac Links, languages and semantics: linked data approaches in The European Library and Europeana. Valentine Charles, Nuno Freire & Antoine Isaac 14 th August 2014, IFLA2014 satellite meeting, Paris The European

More information

Large Scale Behavioral Analytics via Topical Interaction

Large Scale Behavioral Analytics via Topical Interaction Large Scale Behavioral Analytics via Topical Interaction Shih-Chieh Su Information Security and Risk Management Department Qualcomm Inc. San Diego, CA, 92121 shihchie@qualcomm.com arxiv:1608.07625v1 [cs.lg]

More information

Spatial Latent Dirichlet Allocation

Spatial Latent Dirichlet Allocation Spatial Latent Dirichlet Allocation Xiaogang Wang and Eric Grimson Computer Science and Computer Science and Artificial Intelligence Lab Massachusetts Tnstitute of Technology, Cambridge, MA, 02139, USA

More information

Big Data and Large Scale Machine Learning

Big Data and Large Scale Machine Learning CSE740: Project Ideas 12 Sept 2016 CSE740 Projects Mandatory for students enrolled for 2 or 3 credits To be done in groups of 3 Milestones: 1 Send in an email to instructors with

More information

Automatic Triage of Mental Health Forum Posts

Automatic Triage of Mental Health Forum Posts Automatic Triage of Mental Health Forum Posts Benjamin Shickel and Parisa Rashidi University of Florida Gainesville, FL {shickelb, parisa.rashidi}@ufl.edu Abstract As part of the 2016 Computational Linguistics

More information

Language Resources. Khalid Choukri ELRA/ELDA 55 Rue Brillat-Savarin, F Paris, France Tel Fax.

Language Resources. Khalid Choukri ELRA/ELDA 55 Rue Brillat-Savarin, F Paris, France Tel Fax. Language Resources By the Other Data Center over 15 years fruitful partnership Khalid Choukri ELRA/ELDA 55 Rue Brillat-Savarin, F-75013 Paris, France Tel. +33 1 43 13 33 33 -- Fax. +33 1 43 13 33 30 choukri@elda.org

More information

Visual Object Recognition

Visual Object Recognition Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial Visual Object Recognition Bastian Leibe Computer Vision Laboratory ETH Zurich Chicago, 14.07.2008 & Kristen Grauman Department

More information

Digitisation of historic newspapers and voluntary digital deposit of newspaper pre-print files in the the National Library of Estonia

Digitisation of historic newspapers and voluntary digital deposit of newspaper pre-print files in the the National Library of Estonia Digitisation of historic newspapers and voluntary digital deposit of newspaper pre-print files in the the National Library of Estonia Krista Kiisa Digitisation Coordinator Activities to be presented Digitisation

More information

Overcoming the Memory Bottleneck in Distributed Training of Latent Variable Models of Text

Overcoming the Memory Bottleneck in Distributed Training of Latent Variable Models of Text Overcoming the Memory Bottleneck in Distributed Training of Latent Variable Models of Text Yi Yang Northwestern University Evanston, IL yiyang@eecs.northwestern.edu Alexander Yates Temple University Philadelphia,

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

A Robust Number Parser based on Conditional Random Fields

A Robust Number Parser based on Conditional Random Fields A Robust Number Parser based on Conditional Random Fields Heiko Paulheim Data and Web Science Group, University of Mannheim, Germany Abstract. When processing information from unstructured sources, numbers

More information

The DIGMAP Virtual Digital Library

The DIGMAP Virtual Digital Library José Borbinha *, Gilberto Pedrosa *, João Luzio *, Hugo Manguinhas *, Bruno Martins * The DIGMAP Virtual Digital Library Keywords: Geographic information; cartographic heritage; information systems architectures;

More information

Deduced Social Networks for Educational Portal

Deduced Social Networks for Educational Portal Deduced Social Networks for Educational Portal Monika Akbar Dept. of Computer Science Virginia Tech, Blacksburg, VA amonika@vt.edu Clifford A. Shaffer Dept. of Computer Science Virginia Tech, Blacksburg,

More information

How can CLARIN archive and curate my resources?

How can CLARIN archive and curate my resources? How can CLARIN archive and curate my resources? Christoph Draxler draxler@phonetik.uni-muenchen.de Outline! Relevant resources CLARIN infrastructure European Research Infrastructure Consortium National

More information

Conference of Directors of National Libraries in Asia and Oceania. Hanoi, 20 April 2009

Conference of Directors of National Libraries in Asia and Oceania. Hanoi, 20 April 2009 Conference of Directors of National Libraries in Asia and Oceania Hanoi, 20 April 2009 Use of Open Source Software at the National Library of Australia Jan Fullerton Director-General National Library of

More information

Some challenges ahead for the Open Language Archives Community

Some challenges ahead for the Open Language Archives Community Some challenges ahead for the Open Language Archives Community Gary F. Simons SIL International Co-coordinator with Steven Bird, Open Language Archives Community Workshop on Language Archives in the Americas

More information