JOHN SHEPHERDSON... AUTOMATIC KEYWORD GENERATION TECHNICAL SERVICES DIRECTOR UK DATA ARCHIVE UNIVERSITY OF ESSEX...

Size: px

Start display at page:

Download "JOHN SHEPHERDSON... AUTOMATIC KEYWORD GENERATION TECHNICAL SERVICES DIRECTOR UK DATA ARCHIVE UNIVERSITY OF ESSEX..."

Eleanore Crawford
5 years ago
Views:

1 AUTOMATIC KEYWORD GENERATION.... JOHN SHEPHERDSON... TECHNICAL SERVICES DIRECTOR UNIVERSITY OF ESSEX... DEVCON1 UNIVERSITY OF ESSEX 12 APRIL 2013

2 Abstract HASSET is a subject thesaurus that has been developed by the UK Data Archive over more than 20 years. It provides a standard set of key words that can be used to tag items that have content relating to the fields of Humanities and Social Science. In the past the tagging has been carried out by human experts, but we have recently developed a prototype system that can generate a list of keywords automatically, given an arbitrary piece of text. Learn about our experiences of generating keywords automatically, and the evaluation of the results See a demo of automated keyword generation Although we have used HASSET (Humanities and Social Science Electronic Thesaurus) keywords, the approach holds for other keyword sets.

3 Overview brief background: the UK Data Archive and the UK Data Service cataloguing practices and thesaurus tools SKOS-HASSET RDF version of thesaurus auto generation of keywords technologies used evaluation findings application to web page metadata acknowledgements and questions

4 The UK Data Archive and the UK Data Service based at the University of Essex since 1967 curator of the largest collection of digital data in the social sciences and humanities in the UK see data-archive.ac.uk for more details makes these available via the new UK Data Service UK Data Service also provides value-added services for UK Census data, government surveys and beyond UK Data Service includes Universities of Essex, Manchester (Mimas, CCSR), Leeds, Southampton, Edinburgh (Edina) and University College London See ukdataservice.ac.uk for more details

The UK Data Service: cataloguing standards the UK Data Service indexes over 5000 digital data collections and the number is ever growing all catalogued at thematic level many

5 The UK Data Service: cataloguing standards the UK Data Service indexes over 5000 digital data collections and the number is ever growing all catalogued at thematic level many also indexed at variable level available via Discover ukdataservice.ac.uk and every data collection is indexed with HASSET terms HASSET is employed in the search

6 HASSET multidisciplinary thesaurus developed originally to support the UK Data Archive/UK Data Service collections coverage in the core subject areas of social science uses standard hierarchical relationships: TT (top term) BT (broader term) NT (narrower term) RT (related term) USE (from non-preferred term to preferred term); UF (from preferred term to non-preferred term). role of HASSET in the Archive is twofold: used internally for indexing studies and series with HASSET terms also a separate product licensed to others

7 ELSST European Language Social Science Thesaurus (ELSST) is a multi-lingual thesaurus, based on core English terms taken from HASSET translated into 11 languages (with more on the way) closely connected with HASSET, but must demonstrate international applicability of all terms

8 Applying SKOS to HASSET SKOS/RDF what is RDF? Resource Description Framework (RDF) describes data using simple format subject predicate object e.g. car hascolour red So, what is SKOS? Simple Knowledge Organization System SKOS is set of RDF predicates to describe relationships between thesaurus terms e.g. skos:concept162 skos:preflabel CAR e.g. skos:concept162 skos:altlabel AUTOMOBILE it encodes these products in a standardised way to make their structures comparable and to facilitate interaction

9 Applying SKOS to HASSET (2) SKOS has been applied to HASSET persistence via GUIDs version control all terms date stamped all changes recorded live versions of thesaurus products (SKOS-HASSET, ELSST) made at agreed, regular intervals with recognised annual major incremental versioning we are using Pubby to publish our SKOS provides Linked Data interface to RDF data held in BrightstarDB triple store

10 Automated indexing: four corpora Nesstar questions/variables (humanly indexed during project) 17.5k Questionnaires 2.5k catalogue records 5.5k publications (case studies and support/how to guides) 0.25k

11 Automated indexing: the task automatically index the four corpora, using HASSET terms evaluate the results present as a case study (via SKOS-HASSET blog) pre-processing tasks: conversion of PDFs to text extraction of metadata (manual keywords) some were embedded within PDFs others held externally in databases extraction of the data into two file types:.txt (actual text) and.key (gold-standard keywords)

12 Automated indexing: experimental work three methodologies used: Term Frequency/Inverse Document Frequency (TF/IDF) model Keyphrase Extraction Algorithm (KEA) Solr search

13 TF/IDF model our text sample was small so we considered this model, which requires no training data, first processed 2.5k SQB documents no controlled vocabulary results: keywords returned with low domain-specific information, although ours is a domain-specific collection mapping extracted keywords to HASSET returned few matches e.g. it failed to find matches for the keyword Liberal Party although it exists in HASSET but in a different form (BRITISH LIBERAL PARTY) and (LIBERAL PARTY (GREAT BRITAIN))

14 Keyphrase Extraction Algorithm (KEA) keyword indexing using a Controlled Vocabulary uses training data (based on keyword coverage) builds a classifier training model (WEKA) the algorithm is based on machine learning and works in two main steps: candidate term identification identifies phrases (n-grams) from the text and maps these to HASSET filtering uses a learned model (from our training data) to identify the most significant keywords based on features

15 Keyphrase Extraction Algorithm (KEA) (2) created a training model using human indexer s keywords 80% of text used for training 20% of text used for testing uses SKOS-HASSET as controlled vocabulary used stop-word list and trained KEA to avoid method terms ( do you think, closest to your view )

17 KEA automated indexing: Step by step Wrapped KEA Jar file with client Training mode java -jar kea.jar -m <output:model_location> -t <input:data_location> Generation mode java -jar kea.jar -d <input:data_location> -m <input:model_location> -n <output:max_no_keywords> Can also set: thesaurus/cv file and format; stemmer type document language and encoding; stopwords file; min and max no of words in a phrase; minimum occurrence of a phrase in a document

18 KEA automated indexing: Step by step (2) Evaluation Used PowerShell to generate spreadsheet for each document in corpus, manual vs auto keywords Used PowerShell to generate one summary spreadsheet for each corpus: F1, recall, precision.

19 KEA automated indexing: results (Recall and Precision) broad Recall scores: case studies/support guides 0.73 SQB 0.5 Nesstar 0.36 catalogue records 0.2 (low) this suggests that KEA could be usefully employed to suggest new relevant terms for full-text corpora broad precision scores: SQB 0.47 Catalogue records 0.42 Nesstar 0.41 Case studies/support guides 0.25 overall, this suggests that KEA keywords are very often relevant

20 KEA automated indexing: results (more) Little overlap between KEA keywords and manual keywords (on average KEA found keywords per document across the four corpora, of which only 2.33 were exact matches with the manual keywords) However, a high percentage of KEA keywords were considered relevant/suitable even if they were not exact matches: 33% for the SQB corpus with an average of 25% across all four corpora KEA could be a very useful tool for indexers

21 Solr indexing and search runs against SKOS-HASSET searches every word finds phrases as well as single words uses stop words returns preferred term if synonym is found uses a non-aggressive stemming approach demo

22 Solr indexing and search (2) uses inverted search index SKOS-HASSET RDF to create Solr core text entered is used to search core one word at a time phrase-matching achieved be de-inverting search used because text input (1000 chars) much smaller than thesaurus (7,0000 words) multilingualism can be achieved by translating SKOS- HASSET and having a core per input language

23 Automated indexing: findings KEA training needs large corpus and takes tens of minutes generation too slow to run in real-time Solr no training size of corpus largely immaterial very fast can use in real time no learning, cannot suggest new terms

24 Automated indexing: findings (2) Both can easily use with different thesaurus can easily extend stop words BUT more work is needed to investigate further and to see how could be incorporated technically, and in terms of process, into our systems

25 Crude Comparison Solr and Kea Used same 10 abstracts from Catalogue Records as input Kea found 99 unique HASSET terms Solr found 231 unique HASSET terms So Solr is better than Kea? Not so fast Kea found 24 not found by Solr Kea found 3 phrases only partially found by Solr e.g. INFORMATION/LIBRARY SYSTEMS AND SERVICES vs INFORMATION

26 KEA tag cloud

27 Solr tag cloud

28 Application to web page metadata tags Theoretical approach: For each page, identify content section(s) Feed content in to keyword generator Optionally review suggested keywords Insert keywords in to metadata tags

29 Application to web page metadata tags Example: UK Data Service about our data page Generated keywords (Solr): ARCHIVES, CATALOGUES, CENTRAL GOVERNMENT, DATA, DIARIES, ECONOMIC INDICATORS, ECONOMICS, FIELDS, GOVERNMENT, GRANTS, IMAGE, MARKET RESEARCH, MATERIALS, PAPER, PHOTOGRAPHS, REPORTS, RESEARCH, RESEARCH GRANTS, SURVEYS, TEACHING

30 Want to find out more? Using Solr search in a.net environment, Matthew Brumpton, Breakout 4 (13:30-14:30) SKOS-HASSET browser: HASSET browser: SKOS-HASSET Project web site: SKOS-HASSET blog:

31 Acknowledgements The automatic keyword generation work described here was undertaken as part of the JISC-funded SKOS-HASSET project Project Manager: Lucy Bell Evaluators: Lorna Balkan, Suzanne Barbalet KEA programming: Mahmoud El-Haj SKOS/RDF programming: Darren Bell Solr programming: Oscar Dovao

32 Questions?

33 CONTACT UNIVERSITY OF ESSEX WIVENHOE PARK COLCHESTER ESSEX CO4 3SQ..... T +44 (0) E info@data-archive.ac.uk data-archive.ac.uk

Innovation in Thesaurus Management

Innovation in Thesaurus Management Lucy Bell Management Information Manager UK Data Archive IASSIST 2013, Cologne 31 May 2013 Two thesauri; two projects SKOS-HASSET 10 month, Jisc-funded project to enhance