Similar documents
IRUS-UK: Improving understanding of the value and impact of institutional repositories

Audio set-up. » Click on the profile icon at the bottom of. Built-in speakers and microphones create echoes, feedback & spoil the session for others

Making ETDs count in UK repositories. Paul Needham, Cranfield University ETD2014, 24 th July 2014

The COUNTER Code of Practice for Articles

consistency, clarity, simplification and continuous maintenance Release 5

Open Research Online The Open University s repository of research publications and other research outputs

Infrastructure for the UK

Enhancing E-Journal Access In A Digital Work Environment

Institutional Repository Usage Statistics

The Topic Specific Search Engine

Approaching Librarianship from the Data: Using Bibliomining for Evidence-Based Librarianship

INSTITUTIONAL REPOSITORY SERVICES

Putting Open Access into Practice

Title Adaptive Lagrange Multiplier for Low Bit Rates in H.264.

Increasing access to OA material through metadata aggregation

LUND UNIVERSITY Open Access Journals dissemination and integration in modern library services

Demystifying Scopus APIs

Open Access, RIMS and persistent identifiers

Getting Started: Primo Central Index. 1. Getting Started: Primo Central Index. 1.1 Primo Central Index. Notes:

CORE: Improving access and enabling re-use of open access content using aggregations

Full Issue: vol. 45, no. 4

NRF Open Access Statement Impact Survey

Pilot Study for the WHOIS Accuracy Reporting System: Preliminary Findings

Making scholarly statistics count in UK repositories. RSP Statistics Webinar Paul Needham, Cranfield University 26 February 2013

Web of Science. Platform Release Nina Chang Product Release Date: December 10, 2017 EXTERNAL RELEASE DOCUMENTATION

How Primo Works VE. 1.1 Welcome. Notes: Published by Articulate Storyline Welcome to how Primo works.

DRI: Dr Aileen O Carroll Policy Manager Digital Repository of Ireland Royal Irish Academy

Data Curation Profile Human Genomics

Fixed Broadband Analysis Report. 01 October December 2013 between 00:00:00 and 24:00:00 Bahrain. Published 16 January 2014.

Follow this and additional works at:

Inside Discovery The Journey to Summon 2.0

Online the Library

Citation Services for Institutional Repositories: Citebase Search. Tim Brody Intelligence, Agents, Multimedia Group University of Southampton

Monitoring and Reporting Drafting Team Monitoring Indicators Justification Document

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

Jisc Research Data Shared Service. Looking at the past, looking to the future

COAR Interoperability Roadmap. Uppsala, May 21, 2012 COAR General Assembly

Implementing Web-Scale Discovery in an Academic Research Library

ODPi and Data Governance Free Your MetaData! October 10, 2018

Digital Libraries at Virginia Tech

In the absence of a contextslide, CSIRO has a data repository ( It s

Find your way around Library

Scientific databases

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

THE FRIENDLY GUIDE TO RELEASE 5

IDC MarketScape: Worldwide Datacenter Transformation Consulting and Implementation Services 2016 Vendor Assessment

Joint Industry Committee for Web Standards JICWEBS. Reporting Standards. AV / Ad Web Traffic

How to use EBSCOhost Research Databases

CrossRef tools for small publishers

The quality of your literature review is dependent on you knowing how to identify the. to use them effectively SESSION OVERVIEW THOUGHT OF THE DAY

English version. Professional Plagiarism Prevention. ithenticate Guide UNIST LIBRARY

Interim Report Technical Support for Integrated Library Systems Comparison of Open Source and Proprietary Software

The Curated Web: A Recommendation Challenge. Saaya, Zurina; Rafter, Rachael; Schaal, Markus; Smyth, Barry. RecSys 13, Hong Kong, China

Digital repositories as research infrastructure: a UK perspective

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Web of Science. Platform Release 5.29 Release 2. Nina Chang Product Release Date: June 3, 2018 EXTERNAL RELEASE DOCUMENTATION

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Selected Members of the CCL-EAR Committee Review of EBSCO S MASTERFILE PREMIER September, 2002

InfoSci -Databases Platform

COUNTER Code of Practice 3.0 Technical Specifications for COUNTER Reports

InfoSci -Databases Platform

REVIEW OF THE RPS MAP OF EVIDENCE

Using Caltech s Institutional Repository to Track OA Publishing in Chemistry

The DART-Europe E-theses Portal

Building Institutional Repositories: Emerging Challenges

A Cloud-based Web Crawler Architecture

IAB Podcast Measurement Guidelines

AWERProcedia Information Technology & Computer Science

Citation Services for Institutional Repositories: Citebase Search. Tim Brody Intelligence, Agents, Multimedia Group University of Southampton

Data publication and discovery with Globus

SCOPUS. Scuola di Dottorato di Ricerca in Bioscienze e Biotecnologie. Polo bibliotecario di Scienze, Farmacologia e Scienze Farmaceutiche

Open Access Statistics: Interoperable Usage Statistics for Open Access Documents

Internet Engineering Task Force (IETF) Request for Comments: 8336 Category: Standards Track. March 2018

Usability Testing Essentials

Competitive Intelligence and Web Mining:

ISO INTERNATIONAL STANDARD. Geographic information Quality principles. Information géographique Principes qualité. First edition

EBSCO MegaFILE. Basic and Advanced Searching. Rodney A. Briggs Library

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Unique Object Identifiers

National College of Ireland BSc in Computing 2017/2018. Deividas Sevcenko X Multi-calendar.

Getting Acquainted with PsycINFO (EBSCOhost)

Collection Policy. Policy Number: PP1 April 2015

CrossRef developments and initiatives: an update on services for the scholarly publishing community from CrossRef

MT at NetApp This is how we do it

Survey of Research Data Management Practices at the University of Pretoria

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Searching the Deep Web

Europeana update: aspects of the data

Access from the University of Nottingham repository:

A recommendation engine by using association rules

CCL-EAR COMMITTEE REVIEW Elsevier ScienceDirect Health & Life Sciences Journals (College Edition) October 2011

What do we need to build a successful knowledge base?

Nikolay Lobachevsky Academic Library

SciVerse Scopus. Date: 21 Sept, Coen van der Krogt Product Sales Manager Presented by Andrea Kmety

Standard Development Timeline

Performing searches on Érudit

ABSTRACT I. INTRODUCTION II. METHODS AND MATERIAL

COMPLIANCE 25TH MAY Are you prepared for the NEW GDPR rules?

On Demand Web Services with Quality of Service

EBSCO Comprehensive Solutions For Your Library. LIBER 46 th Annual Conference Patras, Hellas, 6 th July 2017

Transcription:

Provided by the author(s) and University College Dublin Library in accordance with publisher policies. Please cite the published version when available. Title Developing COUNTER standards to measure the use of Open Access resources Author(s) Greene, Joseph Publication date Conference details Link to online version Item record/more information 2017-05-24 9th International Conference on Qualitative and Quantitative Methods in Libraries (QQML2017), Limerick, Ireland, 23-26 May 2017 http://www.isast.org/qqml2017.html http://hdl.handle.net/10197/8464 Downloaded 2018-03-11T00:06:51Z The UCD community has made this article openly available. Please share how this access benefits you. Your story matters! (@ucd_oa) Some rights reserved. For more information, please see the item record link above.

Developing COUNTER standards to measure the use of Open Access resources Joseph W. Greene 1 1 University College Dublin and COUNTER Robots Working Group Abstract: There are currently no standards for measuring the use of open digital content, including cultural heritage materials, research data, institutional repositories and open access journals. Such standards would enable libraries and publishers that invest in open digital infrastructure to make evidence-based decisions and demonstrate the return on this investment. The most closely related standard, the COUNTER Code of Practice (CoP), was designed for subscription access e-resources and ensures that publishers provide consistent, credible and comparable usage data. In the open environment, computer programs known as web robots constantly download open content and must be filtered out of usage statistics. The COUNTER Robots Working Group has recently been formed to address this problem and to recommend robot detection techniques that are accurate, applicable and feasible for any provider of open content. Once accepted, they will be incorporated into the COUNTER CoP 5. In this paper we describe the overall goals of the analysis, the scope and techniques for building the dataset and the robot detection techniques under investigation. Keywords: Open Access, institutional repositories, OA journals, usage statistics, downloads, web robots, data mining 1. Introduction Project COUNTER has brought together publishers, vendors and librarians to develop and maintain the COUNTER Code of Practice (CoP) since 2003. The adoption of the CoP ensures that publishers and vendors can provide consistent, credible and comparable usage data for e-resources including peer reviewed journal articles and e-books (COUNTER, 2017a). However, the COUNTER CoP was designed for resources that are only accessed behind subscription barriers. In the open environment, publishers and libraries offer free, unrestricted access not only to their designated communities, but also to users of computer programs, known as web robots, designed to automatically crawl the web for content. Web robots account for a very large percentage (between 40% and 85%) of open content usage and must be filtered out of the statistics for them to be meaningful (Greene, 2016, Huntington et al., 2008). This paper describes the methodology used to test the effectiveness of a set of filters to detect and remove robot activity from open access usage statistics, balancing the highest accuracy of usage statistics with lowest barrier to implementation. The results of this research will inform and, pending community approval, become a part of the COUNTER Code of Practice 5. It should be noted that the research itself is currently in progress and is subject to variation from what is documented here. 2. Raw data to structured data In September 2016, Project COUNTER formed a Robots Working Group, composed of volunteer experts from large publishers and vendors (EBSCO, Elsevier, Wiley) and representatives of smaller publishers, open access journal hosts, institutional repositories (IRUS-UK, DSpace, Eprints, Digital Commons) and aggregators (OpenAIRE). Three members offered the use of a subset of their data: Bielefeld University, which hosts a number of open access journals, IRUS-UK, which aggregates usage data from 127 open access institutional

repositories (IRUS-UK, 2017) and Wiley, a large academic publisher that hosts a wide range of content, both open and subscription based. Each dataset included raw usage event data for the one week period of 3 to 9 October 2016. The Bielefeld dataset consisted of Apache log files (anonymised via the final 2 octets of the IP addresses) from 7 open access journals, each hosted on PKP's Open Journal Systems (OJS). Three of the largest log files were selected (representing three journals). The IRUS-UK dataset contained item downloads from 97 UK open access institutional repositories. The Wiley dataset included all usage events on the platform, with registered crawlers including Googlebot and some technical partners removed. Characteristics of the datasets are described in Table 1. Source Original format Lines, raw data Lines, downloads only (N) Sample size (n) Confidence Bielefeld Apache 232,944 14,536 202 95% OJS journals server logs IRUS-UK TSV 1,935,689 1,935,683 204 95% Wiley TSV 30,419,834 5,098,763 204 95% 1 1 Some registered crawlers were removed from Wiley's raw data and the robots-to-total value is unknown, so confidence level may vary in the final analysis Table 1. Data sources, sizes and samples The datasets were converted into a usable format and imported into a PostgreSQL database. Each dataset presented its own challenges in terms of the restructuring. The OJS dataset included all HTTP requests to each site, so requests for full text downloads had to be accurately identified and extracted using regular expressions. The Apache combined format log file was then converted to SQL statements using regular expressions in order to import them into PostgreSQL. The Wiley dataset was quite large at 6.3GB and also included many events that were not downloads, which were removed. The final database contained 7,048,991 rows, each row containing data about a unique download event including data source, IP address, user agent, referring URL, request URL, unique item identifier, date, time and session ID (for Wiley items). 3. Robot detection and filtration experiments With the data from each of the three sources in a single database, the downloads could be queried and sampled in a systematic manner. We devised a simple random sample calculator based on Sheaffer et al. (2006), assuming a robots-to-total ratio (p value) of.85, based on Greene (2016) and Information Power Ltd. (2013) for a confidence level of 95%. It should be noted that the confidence level for the Wiley sample may vary from this in the final analysis for two reasons: the non-inclusion of registered crawlers mentioned above, and the fact that the robots-to-total ratio likely differs from a fully open access dataset. In all, the sample totalled 610 downloads. The sizes of the individual samples are given in Table 1. Additional variables, described in Table 2, were added to the sampled data. Wiley items were labelled open or closed access; 45% of the items in the download sample were found to be open access. A series of three passes were then taken through the dataset in order to determine whether they were made by

humans or robots, following closely the methodology described in Greene (2016). Field download_id session_id item_id IP origin Use Unique identifier for the download event Web site session ID (Wiley only) Unique identifier of the item downloaded IP address that downloaded the item Organisation attributed to IP address indicator Indicator or sum of indicators used to determine robots 1 other_indicators counter_list robot date time source agent agent notes dl_peak_this_item dl_peak_any_item dl_site dl_site_ip_agent dl_site_session_id dl_per_day_peak total_items_downl oaded total_items_downl oaded_ip_agent total_items_downl oaded_session_id first_seen last_seen flagged oa_status Description of any other indicators about robot/human behaviour Boolean true where agent name would match against the COUNTER list of robots, crawlers and spiders Boolean representing if an item was manually determined to be a robot (1 = true = robot) Date of the download Time of the download Source of the download data (IRUS-UK, Bielefeld, Wiley) User agent string provided for the download event Count of agents used by the IP over the course of the sample period; other info as needed Total downloads of this item by this IP address during the period Highest total downloads of any single item by this IP address during the period Total downloads by this IP address during the period Total downloads by this IP/Agent pair in the period Total downloads with this session ID (Wiley only) Peak downloads by this IP address on a single day during the period Number of items downloaded by this IP during the period Total number of items downloaded by this IP/Agent pair in the period Total number of items downloaded with this session ID (Wiley only) Date of first download by this IP address during the period Date of last download by this IP address during the period Boolean representing whether the source data provider labeled this as a robot (1 = true = flagged as robot) Status of the article downloaded. Closed, OA (Hybrid) (the item is OA and the journal is hybrid) or OA (full) (the item is in a fully OA journal) (Wiley only) 1 Indicators 1: Agent name. 2: Reverse-lookup. 4: Access/frequency. 8: Other indicator(s) Table 2. Variables used for determining robots The first pass through the data involved identifying and labelling selfidentifying robots based on the user agent field. The second pass required a reverse DNS lookup for the remaining downloads, and identifying unambiguously human or non-human downloads, based on frequency of downloads, number of items downloaded, and the source of the download,

whether internet service provider, server hosting company, university, etc. The third pass on the remaining downloads included querying the database for more information on the session and checking ProjectHoneypot (Unspam Technologies Inc., 2017) for further information on the IP addresses. Once each item in the dataset is fully labelled as either robot or human, the data will be shared with members of the COUNTER Robots Working Group for peer review to ensure the best possible characterisation of the downloads. At this stage the filters listed in Table 3 will be simulated in the sampled data. A new column will be added for each filter to indicate whether the filter would have labelled the download as robot or human. Each filter and combination of filters can then be categorised as true/false positive or true/false negative, and recall and precision calculated. Filter UA string Rate of requests (single item by a single user 2 ) Volume of requests (sitewide by a single user 2 ) Double-clicks (COUNTER CoP 5 7.2) 3 User agents per IP address Request = referrer Tests Check effectiveness of the COUNTER CoP List of internet robots, crawlers and spiders 1 against manual checking Ascertain best-fit threshold, e.g. within a 24 hour period Ascertain best-fit threshold, e.g. within a 24 hour period Determine effectiveness Ascertain best-fit threshold Determine effectiveness 1 https://www.projectcounter.org/code-of-practice/appendices/850-2/ 2 User is defined by best available data: IP address, IP/user agent pair, session ID. This is a proposed adaptation of current COUNTER recommendations (COUNTER, 2017b) 3 https://www.projectcounter.org/code-of-practice/counter-release-5-draft-code-practiceconsultation/ Table 3. Filters to be tested individually and in combination While the (robot) recall and precision is the basic measure of the filters' effectiveness, of greater concern is how accurate the resulting filtered download/usage statistics will be. In this context, the filtered usage statistics are measured using inverse recall and inverse precision. A low inverse recall indicates that many human downloads were excluded from the usage statistics; a low inverse precision indicates that many robots were included in the usage statistics. We will report on both measures, but will favour inverse precision as the measure of 'accuracy' for two reasons: we assume we will achieve a high inverse recall by default (as the proportion of robots to humans is very high), and because while all the other three measures (recall, precision and inverse recall) are in a way statements about what is excluded, inverse precision is a measure of only what is reported as genuine usage. That having been said, inverse recall will be an indicator of overreach for any particular filter, for example in the case of institutions that access e-resources through a proxy, thereby assigning a single IP address to every member of the institution. Once the best combination of filters has been determined, the recommendations will be shared with the COUNTER community and feedback sought. 4. Conclusion Registered vendors undergo independent audit to be considered COUNTER compliant, and for this reason the Code of Practice is generally oriented towards commercial operations. With representatives from PKP, DSpace,

EPrints and DigitalCommons on the working group, it is hoped that the recommendations will also have wide adoption within the Open Access community, and that 'COUNTER conformant' usage statistics will become a norm. As a final note, the filters developed and studied here are not intended to be exhaustive, they are simply a beginning of a set of standards to be built upon that will help in achieving consistent, credible and comparable usage statistics that can be aggregated across many types of scholarly communication platforms. References COUNTER, (2017a). About COUNTER. Available at: https://www.projectcounter.org/about/ (accessed 28 April 2017). COUNTER, (2017b). COUNTER Release 5 Draft Code of Practice, section 'Data Processing'. Available at: https://www.projectcounter.org/code-of-practicesections/data-processing/ (accessed 28 April 2017). Greene, J. W. (2016). Web robot detection in scholarly Open Access institutional repositories, Library Hi Tech, Vol. 34, No. 3, 500 520. Huntington, P., Nicholas, D. & Jamali, H. R. (2008), Web robot detection in the scholarly information environment, Journal of Information Science, 34, 5, 726 741. Information Power Ltd., (2013). IRUS download data: identifying unusual usage. Available at: http://www.irus.mimas.ac.uk/news/irus_download_data_final_report.pdf (accessed 2015-12-11). IRUS-UK, (2017). Available at: http://www.irus.mimas.ac.uk/ (accessed 28 April 2017). Sheaffer, R. L., Mendenhall, W. & Ott, R. L. (2006). Elementary Survey Sampling. Thomson, London. Unspam Technologies Inc., (2017). Project Honeypot. Available at: https://www.projecthoneypot.org (accessed 28 April 2017).