Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Similar documents
Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

ON THE USE OF INTERNET AS A DATA SOURCE FOR OFFICIAL STATISTICS: A STRATEGY FOR IDENTIFYING ENTERPRISES ON THE WEB 1

ISTAT Farm Register: Data Collection by Using Web Scraping for Agritourism Farms

Istat s Pilot Use Case 1

Hands-on immersion on Big Data tools. Extracting data from the web

Extracting data from the web

Web scraping. Donato Summa. 3 WP1 face to face meeting September 2017 Thessaloniki (EL)

A Software Architecture for Progressive Scanning of On-line Communities

Web scraping meets survey design: combining forces

Uses of web scraping for official statistics

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Webscraping at Statistics Netherlands

ESSnet BD SGA2. WP2: Web Scraping Enterprises, NL plans Gdansk meeting. Olav ten Bosch, Dick Windmeijer, Oct 4th 2017

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

1 Preface and overview Functional enhancements Improvements, enhancements and cancellation System support...

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

A SURVEY- WEB MINING TOOLS AND TECHNIQUE

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

An introduction to web scraping, IT and Legal aspects

Information Retrieval

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Study on the Distributed Crawling for Processing Massive Data in the Distributed Network Environment

Collective Intelligence in Action

Scalable Search Engine Solution

Cisco Collaborative Knowledge

E l a s t i c s e a r c h F e a t u r e s. Contents

Istat SW for webscraping

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Automated Online News Classification with Personalization

URLs identification task: Istat current status. Istat developed and applied a procedure consisting of the following steps:

Development of Mobile Search Applications over Structured Web Data through Domain-Specific Modeling Languages. M.Sc. Thesis Atakan ARAL June 2012

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

Web-page Indexing based on the Prioritize Ontology Terms

Feasibility Evidence Description (FED)

Data Analysis Using MapReduce in Hadoop Environment

HCW Human Centred Web. HuCEL: Keywords Experiment Manual. School of Computer Science. Information Management Group

Chapter 6 VIDEO CASES

A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data

DLV02.01 Business processes. Study on functional, technical and semantic interoperability requirements for the Single Digital Gateway implementation

Enhancing Cluster Quality by Using User Browsing Time

Design of a Social Networking Analysis and Information Logger Tool

The Topic Specific Search Engine

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Introducing SAS Model Manager 15.1 for SAS Viya

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Web scraping from company websites and machine learning for the purposes of gaining new digital data Working paper

WHITEPAPER. MemSQL Enterprise Feature List

Risk Intelligence. Quick Start Guide - Data Breach Risk

Lesson 14 SOA with REST (Part I)

IBM SPSS Statistics and open source: A powerful combination. Let s go

Domain-specific Concept-based Information Retrieval System

Case Study Ecommerce Store For Selling Home Fabrics Online

FAST& SCALABLE SYSTEMS WITH APACHESOLR. Arnon Yogev IBM Research

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Election Analysis and Prediction Using Big Data Analytics

ESSnet Big Data WP2: Webscraping Enterprise Characteristics

A Scotas white paper September Scotas OLS

Enhancing applications with Cognitive APIs IBM Corporation

Part I: Data Mining Foundations

Hadoop, Yarn and Beyond

Information Retrieval. Lecture 10 - Web crawling

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

MURDOCH RESEARCH REPOSITORY

NARCIS: The Gateway to Dutch Scientific Information

1. Conduct an extensive Keyword Research

Business Architecture concepts and components: BA shared infrastructures, capability modeling and guiding principles

Enterprise Multimedia Integration and Search

Search Engines and Time Series Databases

Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing

Telling stories about data. A dynamic and interactive approach to disseminate thematic indicators

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

Usability Tests and Heuristic Reviews Planning and Estimation Worksheets

Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

An Architecture to Share Metadata among Geographically Distributed Archives

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

VK Multimedia Information Systems

Enhancing and Extending Microsoft SharePoint 2013 for Secure Mobile Access and Management

ATLAS.ti 6 Distinguishing features and functions

KNOW At The Social Book Search Lab 2016 Suggestion Track

Talend Big Data Sandbox. Big Data Insights Cookbook

Power Systems for Your Business

QLIKVIEW ARCHITECTURAL OVERVIEW

A Capacity Planning Methodology for Distributed E-Commerce Applications

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

Deep Web Crawling and Mining for Building Advanced Search Application

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

CACAO PROJECT AT THE 2009 TASK

SEO WITH SHOPIFY: DOES SHOPIFY HAVE GOOD SEO?

Embedded Technosolutions

MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Fast Innovation requires Fast IT

2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

Online Help in Web 2.0 World Vivek Jain Group Product Manager Adobe Systems

Sourcing - How to Create a Negotiation

Transcription:

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli 1 (barcarol@istat.it), Monica Scannapieco 1 (scannapi@istat.it), Donato Summa 1 (donato.summa@istat.it), Marco Scarnò 2 (m.scarnò@cineca.it) Keywords: Big data, Web Scraping, Web data 1. INTRODUCTION The Internet can be considered as a data source (belonging to the vast category of Big Data), that may be harnessed in substitution, or in combination with, data collected by means of the traditional instruments of a statistical survey. The survey on ICT in enterprises, carried out by all EU Statistical Institutes, is a natural candidate to experiment this approach, as the questionnaire contains a number of questions, related to the characteristics of the websites owned or used by the enterprises, whose answers can be deduced directly by the content of these websites (for instance the presence of web sales functionalities). An experiment is being conducted whose aim is twofold: (i) from a technological point of view, to verify the capability to access the websites indicated by enterprises participating to the sampling survey, and collect all the relevant information, (ii) from a methodological point of view, to use the information collected from the Internet in order to predict the characteristics of the websites not only for surveyed enterprises, but for the whole population of reference, in order to produce estimates with a higher level of accuracy. The first phase of the experiment has been based on the use of the survey data, that is a sample of 19,000 respondent enterprises, that indicated a total of 8,600 websites. Websites have been scraped and collected texts have been used as train and test sets in order to verify the validity of the applied machine learning techniques [1]. In the second phase, the whole reference population (192,000 enterprises) is involved, together with the about 90,000 websites owned or used by them. The web scraping task, that was already crucial in the first phase, becomes critical from the point of view of both efficiency and effectiveness, considering the increased number of websites. For this reason, a number of different solutions are being investigated, based on (i) the use of the Apache suite Nutch/Solr, (ii) the use of the tool HTTrack, (iii) the development of new functionalities for web scraping in the package ADaMSoft making use of JSOUP. In this paper, these alternative solutions are evaluated by comparing obtained results in terms of both efficiency and compliance to Official Statistics (OS) requirements. 2. WEB SCRAPING SYSTEMS In this Section, we first introduce the Web scraping concept and we position our work with respect to previous ones (Section 2.1). Then, in the subsequent sections we provide an overview of the scraping systems under evaluation (Sections 2.2, 2.3 and 2.4) 1 2 Istat- Istituto Nazionale di Statistica Cineca 1

2.1. Web Scraping: State of the Art in Official Statistics Web scraping is the process of automatically collecting information from the World Wide Web, based on tools (called scrapers, internet robots, crawlers, spiders etc.) that navigate, extract the content of websites and store scraped data in local data bases for subsequent elaboration purposes. Previous work on the usage of Internet as a data source for Official Statistics by means of Web scraping was carried out by Statistics Netherlands in recent years [2]. In particular, a first domain of experimentation was related to air tickets: the prices of air tickets were collected daily by Internet robots, developed by Statistics Netherlands supported by two external companies, and the results were stored for several months. The experiment showed that there was a common trend between the ticket prices collected by robots and existing manual collection [3]. Two additional domains of experimentation were Dutch property market and Clothes prices, the first exhibiting more regularity in the sites structure, the latter more challenging with respect to automatic classification due to lack of a standard naming of the items, and variability in the sites organization. Similarly, in Italy a scraping activity was performed to get on consumer electronics prices and airfares [4]. 2.2. The Apache Stack: Nutch/Solr/Lucene The Apache suite used for crawling, content extraction, indexing and searching results is composed by Nutch and Solr. Nutch (available at https://nutch.apache.org/) is a highly extensible and scalable open source web crawler; it facilitates parsing, indexing, creating a search engine, customizing search according to needs, scalability, robustness, and scoring filter for custom implementations. Built on top of Apache Lucene and based on Apache Hadoop, Nutch can be deployed on a single machine as well as on a cluster, if large scale web crawling is required. Apache Solr (available at https://lucene.apache.org/solr/) is an open source enterprise search platform that is built on top of Apache Lucene. It can be used for searching any type of data; in this context, however, it is specifically used to search web pages. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling. Providing distributed search and index replication, Solr is highly scalable. Both Nutch and Solr have an extensive plugin architecture useful when advanced customization is required. Although this web scraping approach requires an initial effort in terms of technological expertise, in the long run it can lead to a substantial return on investment as it can be used on many other contexts to access Big Data sources. As an example, it can be used as platform to access and analyse web resources like blogs or social media, to perform semantic extraction and analysis tasks. 2.3. HTTrack HTTrack (available at http://www.httrack.com/) is a free and open source software tool that permits to mirror locally a web site, by downloading each page that composes its structure. In technical terms it is a web crawler and an offline browser that can be run on several operating systems (Microsoft Windows, Mac OS X, Linux, FreeBSD and Android). HTTrack s strength points are ease of use, fine parameters configurability. It can be run via graphical user interface or in batch mode via command line. 2

2.4. JSOUP JSOUP (http://jsoup.org) permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system (http://adamsoft.sourceforge.net), this latter selected as already including facilities that allow to handle huge data sets and textual information. In this approach we collected the content of the HTML pages and the information on their structure (TAGS), because these can contain discriminant terms that can help us in identifying the nature of the website; for example a button associated to an image called "paypal.jpg" could be a clear sign of web sales functionality. 3. A COMPARATIVE ANALYSIS AND RECOMMENDATIONS The three systems described in the previous sections have been compared with respect to efficiency features (summarized in Error! Reference source not found.). In addition, it is possible to verify how they are appropriate to address requirements that are specific to their usage in Official Statistics production processes (summarized in Table ). Table 1: Efficiency features of the three system Tool # websites reached Average number of webpages per site Time spent Type of Storage Storage dimensions Nutch 7020 / 8550=82,1% 15,2 32,5 hours Binary files on HDFS 2,3 GB (data) 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on file system JSOUP 7835/8550=91,6% 68 11 hours HTML ADaMSoft compressed binary files 16, 1 GB 500MB Looking at Error! Reference source not found.: HTTrack and JSOUP are comparable with respect to web site reachability, while the number of websites reached by Nutch is considerably lower. JSOUP outperforms on the actual download of pages at the reached sites with respect to both the other systems. Time performances are again in favour of JSOUP. However, it is important to notice that we did not spend too much time to optimize the performance of HTTrack as it was the system that we experimented later in time, i.e. there is a margin to optimize such performances in the next steps. 3

Space performances. Given that JSOUP was integrated with the ADaMSoft system, it was possible to compress the files resulting from the scraping and hence to save disk space. Table 2: Effectiveness of the systems with respect to requirements Tool Access to specific element of HTML Pages Download site content as whole for semantic extraction and discovery Document Querying Scalability to Big Data Size Nutch Difficult Easy Easy Easy HTTrack Easy Easy Difficult Difficult JSOUP Easy Easy Difficult Difficult As shown in Table : The column related to the access to specific elements of HTML pages does show that HTTrack and JSOUP are more appropriate than Nutch. In order to understand the implication of such a feature, it is important to notice that the scraping task can be design-based, i.e. it is possible to design in advance the access (i) to elements of specific website structures (e.g. the table of a specific site related to the price list of some products) or (ii) to general website elements (e.g. the label of a shopping cart image possibly present). Nutch does not allow (easily) this kind design-time access, being instead more appropriate to carry out exploratory runtime analysis (as remarked also by the following features). The column Download Site Content does show that the three systems perform well, i.e. they address the requirement by permitting an easy implementation of it. The column Document Querying shows that Nutch is the best, and this is indeed motivated by the natively integration of Nutch with Solr/Lucene platform. Finally, to address the scalability need, the native integration of Nutch with MapReduce/Hadoop infrastructure makes Nutch again the best choice. However we do observe that the fact that JSOUP has been integrated within ADaMSoft also permits to store on a secondary structured or semi-structured storage in order to scale up (though we have not yet tested this functionality). 4. CONCLUSIONS A first remark is that a scraping task can be carried out for different purposes in OS production, and the choice of one tool for all the purposes may not always be possible. For the specific scraping task required by the ICT Usage in Enterprises survey, the usage of JSOUP/ADaMSoft appears to be the most appropriate. In the second step of the 4

project, when we will have to scale up to about 90,000 websites, we will test how such a system performs with respect to the scalability issue. Finally, we highlight that the scraping task evaluated in this paper with three different systems is a sort of generalized scraping task: it indeed assumes a data collection without any specific assumption on the structure of the websites. In this sense it goes a step further with respect to all the previous experiences REFERENCES [1] G. Barcaroli, A. Nurra, S. Salamone, M. Scannapieco, M. Scarnò, D. Summa, Internet as Data Source in the Istat Survey on ICT. Accepted for publication in Austrian Journal of Statistics (2014) [2] O. Ten Bosh, D. Windmeijer. On the Use of Internet Robots for Official Statistics. In MSIS-2014. URL:http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.50/2014/Top ic_3_nl.pdf. (2014) [3] R. Hoekstra, O. ten Bosh, F. Harteveld, Automated data collection from web sources for official statistics: First experiences." Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, 28(3-4), 2012. [4] R. Giannini, R. Lo Conte, S. Mosca, F. Polidoro, F. Rossetti, Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation, Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014. 5