Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies
|
|
- Robert Daniels
- 6 years ago
- Views:
Transcription
1 NTTS 2015 Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli(*) (barcarol@istat.it), Monica Scannapieco (*) (scannapi@istat.it), Marco Scarnò (*) (m.scarnò@cineca.it), Donato Summa (*) (donato.summa@istat.it) (*) Istituto Nazionale di Statistica (Istat) (**) Consorzio Interuniversitario per il Calcolo Automatico (CINECA)
2 Web scraping definition and types Web scraping is the process of automatically collecting information from the World Wide Web, based on tools (called scrapers, internet robots, crawlers, spiders etc.) that navigate, extract the content of websites and store scraped data in local data bases for subsequent elaboration purposes. We can distinguish two different kinds of web scraping: 1. specific web scraping, when both structure and content of websites to be scraped are perfectly known, and crawlers just have to replicate the behaviour of a human being visiting the website and collecting the information of interest. Typical areas of application: data collection for price consumer indices (ONS, CBS, Istat); 2. generic web scraping, when no a priori knowledge on the content is available, and the whole website is scraped and subsequently processed in order to infer information of interest.
3 An application on «ICT in enterprises» survey
4 Web scraping different techniques and tools Different solutions for the web scraping are being investigated, based on the use of (i) the Apache suite Nutch/Solr ( for crawling, content extraction, indexing and searching results is a highly extensible and scalable open source web crawler; it facilitates parsing, indexing, creating a search engine, customizing search according to needs, scalability, robustness, and scoring filter for custom implementations; (ii) HTTrack ( a free and open source software tool that permits to mirror locally a web site, by downloading each page that composes its structure. In technical terms it is a web crawler and an offline browser; (iii) JSOUP ( permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system ( this latter selected as already including facilities that allow to handle huge data sets and textual information.
5 Web scraping solutions evaluation These techniques are evaluated by taking into account: 1. efficiency: number of websites actually scraped on the total and execution performance; 2. effectiveness: completeness and richness of collected text that can influence the quality levels of prediction.
6 Web scraping techniques evaluation: efficiency Solution # websites reached Average number of webpages per site Time spent Nutch 7020 / 8550=82,1% 15,2 32,5 hours Type of Storage Binary files on HDFS Storage dimensions 2,3 GB (data) 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on file system 16, 1 GB JSOUP 7835/8550=91,6% hours HTML ADaMSoft compressed binary files 500MB
7 Web scraping techniques evaluation: effectiveness The evaluation of the effectiveness of the different solutions is being based on the application of the steps of text and data mining to collected data in order to predict a subset of the target information of the survey. The developed application is available on the Adamsoft website: appscripts.html
8 Prediction of survey information by text and data mining Application of Naïve Bayes to predict all questions in section B8 Question B8:"indicate if the Website have any of the following facilities" a) Online ordering or reservation or booking (web sales functionality) Precision Performance of Naive Bayes Sensitivity Specificity Observed proportion Predicted proportion b) Tracking or status of orders placed c) Description of goods or services, price lists d) Personalized content in the website for regular/repeated visitors e) Possibility for visitors to customize or design online goods or services f) A privacy policy statement, a privacy seal or a website safety certificate g) Advertisement of open job positions or online job application
9 Web scraping: from sample to whole population So far, the three different solutions for web scraping have been applied to a limited number of websites (related to the subset of enterprises respondents in the sampling survey and declaring to have a website: 8,600). Next step is the scraping of all the websites owned by the enterprises included to the population of interest (212,000). Two problems: 1. URLs retrieval: how to individuate all the websites owned by the 212,000 (between 90,000 and 100,000 are expected to own one website); 2. massive scraping: how to increase efficiency when scaling a factor 10: O(10^4) O(10^5)
10 General idea: for each enterprise: Web scraping: URLs retrieval 1. Querying search engines with the enterprise denomination 2. Processing the first ten URLs retrieved in order to choose the right one for the given enterprise Processing: a) matching of the enterprises information (denomination, fiscal code, etc. available from administrative data) and the content of the first ten URLs retrieved; b) use of the subset of enterprises (from survey data) for which the correct URL is known, as a training set in order to maximise the precision of the choice function; c) application of the choice function to the whole set.
11 Web scraping: mass scraping Use of Nutch on top of MapReduce / Hadoop to harness parallelism Completed tasks: enhancement of Nutch by using the following plugins: HTML-Plugin (Nutch custom search) to retrieve HTML tags Metatag plugin (urlmeta) to add custom metatag information integration of Nutch with analysis activities in order to execute the whole process Future task: deployment and execution of Adamsoft/JSOUP and Nutch (HTTrack is abandoned due to its scalability problems) on CINECA PICO platform (1,080 cores, 54 nodes, 6.9 TB RAM)
12 Conclusions 1. A first remark is that a scraping task can be carried out for different purposes in an Official Statistics production environment, and the choice of a unique tool for all the purposes may not always be possible. 2. As for this specific case, the final evaluation of the different solutions will depend on the evaluation of the results of their execution for massive scraping on an adequate platform (PICO). 3. Finally, we highlight that the scraping application here presented is a sort of generalized scraping task, as it does not require any specific assumption on the structure of the websites. In this sense it goes a step further with respect to previous experiences.
Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies
Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli 1 (barcarol@istat.it), Monica Scannapieco 1 (scannapi@istat.it), Donato Summa
More informationHands-on immersion on Big Data tools. Extracting data from the web
Hands-on immersion on Big Data tools Extracting data from the web Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION 1 Summary IaD & IaD methods Web Scraping
More informationExtracting data from the web
Extracting data from the web Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION 1 Summary IaD & IaD methods Web Scraping tools ICT usage in enterprises URL retrieval
More informationWeb scraping. Donato Summa. 3 WP1 face to face meeting September 2017 Thessaloniki (EL)
Web scraping Donato Summa Summary Web scraping : Specific vs Generic Web scraping phases Web scraping tools Istat Web scraping chain Summary Web scraping : Specific vs Generic Web scraping phases Web scraping
More informationIstat s Pilot Use Case 1
Istat s Pilot Use Case 1 Pilot identification 1 IT 1 Reference Use case X 1) URL Inventory of enterprises 2) E-commerce from enterprises websites 3) Job advertisements on enterprises websites 4) Social
More informationON THE USE OF INTERNET AS A DATA SOURCE FOR OFFICIAL STATISTICS: A STRATEGY FOR IDENTIFYING ENTERPRISES ON THE WEB 1
Rivista Italiana di Economia Demografia e Statistica Volume LXX n.4 Ottobre-Dicembre 2016 ON THE USE OF INTERNET AS A DATA SOURCE FOR OFFICIAL STATISTICS: A STRATEGY FOR IDENTIFYING ENTERPRISES ON THE
More informationIstat SW for webscraping
Istat SW for webscraping Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION 1 Shortly we have 2 use cases Url retrieval Webscraping of enterprise websites 2
More informationISTAT Farm Register: Data Collection by Using Web Scraping for Agritourism Farms
ISTAT Farm Register: Data Collection by Using Web Scraping for Agritourism Farms Giulio Barcaroli*, Daniela Fusco*, Paola Giordano*, Massimo Greco*, Valerio Moretti*, Paolo Righi*, Marco Scarnò** (*) Italian
More informationA Software Architecture for Progressive Scanning of On-line Communities
A Software Architecture for Progressive Scanning of On-line Communities Roberto Baldoni, Fabrizio d Amore, Massimo Mecella, Daniele Ucci Sapienza Università di Roma, Italy Motivations On-line communities
More informationUses of web scraping for official statistics
Uses of web scraping for official statistics ESTP course on Big Data Sources Web, Social Media and Text Analytics, Day 1 Olav ten Bosch, Statistics Netherlands THE CONTRACTOR IS ACTING UNDER A FRAMEWORK
More informationMeasurement and evaluation: Web analytics and data mining. MGMT 230 Week 10
Measurement and evaluation: Web analytics and data mining MGMT 230 Week 10 After today s class you will be able to: Explain the types of information routinely gathered by web servers Understand how analytics
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationURLs identification task: Istat current status. Istat developed and applied a procedure consisting of the following steps:
ESSnet BIG DATA WorkPackage 2 URLs identification task: Istat current status Giulio Barcaroli, Monica Scannapieco, Donato Summa Istat developed and applied a procedure consisting of the following steps:
More informationAnalytics: measuring web site success. MBA 563 Week 4
Analytics: measuring web site success MBA 563 Week 4 Overview: Methods of measuring marketing success You can t manage what you can t measure (Bob Napier, ex CIO, Hewlett Packard) 1. Data mining and predictive
More informationAn introduction to web scraping, IT and Legal aspects
An introduction to web scraping, IT and Legal aspects ESTP course on Automated collection of online proces: sources, tools and methodological aspects Olav ten Bosch, Statistics Netherlands THE CONTRACTOR
More informationBUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna
BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna Dipartimento di Informatica Università degli Studi di Milano Italy Once upon a time UbiCrawler UbiCrawler
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationCollective Intelligence in Action
Collective Intelligence in Action SATNAM ALAG II MANNING Greenwich (74 w. long.) contents foreword xv preface xvii acknowledgments xix about this book xxi PART 1 GATHERING DATA FOR INTELLIGENCE 1 "1 Understanding
More informationemetrics Study Llew Mason, Zijian Zheng, Ron Kohavi, Brian Frasca Blue Martini Software {lmason, zijian, ronnyk,
emetrics Study Llew Mason, Zijian Zheng, Ron Kohavi, Brian Frasca Blue Martini Software {lmason, zijian, ronnyk, brianf}@bluemartini.com December 5 th 2001 2001 Blue Martini Software 1. Introduction Managers
More informationScalable Search Engine Solution
Scalable Search Engine Solution A Case Study of BBS Yifu Huang School of Computer Science, Fudan University huangyifu@fudan.edu.cn COMP620028 Information Retrieval Project, 2013 Yifu Huang (FDU CS) COMP620028
More informationIntroduction to MapReduce Algorithms and Analysis
Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)
More informationDesktop Crawls. Document Feeds. Document Feeds. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used
More informationBigDataBench-MT: Multi-tenancy version of BigDataBench
BigDataBench-MT: Multi-tenancy version of BigDataBench Gang Lu Beijing Academy of Frontier Science and Technology BigDataBench Tutorial, ASPLOS 2016 Atlanta, GA, USA n Software perspective Multi-tenancy
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationOptimizing Apache Nutch For Domain Specific Crawling at Large Scale
Optimizing Apache Nutch For Domain Specific Crawling at Large Scale Luis A. Lopez, Ruth Duerr, Siri Jodha Singh Khalsa luis.lopez@nsidc.org http://github.com/b-cube IEEE Big Data 2015, Santa Clara CA.
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationMulti-tenancy version of BigDataBench
Multi-tenancy version of BigDataBench Gang Lu Institute of Computing Technology, Chinese Academy of Sciences BigDataBench Tutorial MICRO 2014 Cambridge, UK INSTITUTE OF COMPUTING TECHNOLOGY 1 Multi-tenancy
More informationInformation Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationFocused Crawling with
Focused Crawling with ApacheCon North America Vancouver, 2016 Hello! I am Sujen Shah Computer Science @ University of Southern California Research Intern @ NASA Jet Propulsion Laboratory Member of The
More information1st International KEYSTONE Conference IKC 2015 Coimbra Portugal 8-9 September 2015
1st International KEYSTONE Conference IKC 2015 Coimbra Portugal 8-9 September 2015 Recommending Web Pages using Item-based Collaborative Filtering Approaches Sara Cadegnani 1, Francesco Guerra 1, Sergio
More informationFocused Crawling with
Focused Crawling with ApacheCon North America Vancouver, 2016 Hello! I am Sujen Shah Computer Science @ University of Southern California Research Intern @ NASA Jet Propulsion Laboratory Member of The
More informationMicro Focus Enterprise View. Installing Enterprise View
Micro Focus Enterprise View Installing Enterprise View Micro Focus The Lawn 22-30 Old Bath Road Newbury, Berkshire RG14 1QN UK http://www.microfocus.com Copyright Micro Focus 2009-2014. All rights reserved.
More informationDesign and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch
619 A publication of CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 The
More informationA New Model of Search Engine based on Cloud Computing
A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationCS November 2018
Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationCS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014
CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions
More informationNext-Generation Cloud Platform
Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology
More informationPerformance Comparison of Hive, Pig & Map Reduce over Variety of Big Data
Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics
More informationEvolving To The Big Data Warehouse
Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from
More informationHadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017
Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google
More informationBigDataBench: a Big Data Benchmark Suite from Web Search Engines
BigDataBench: a Big Data Benchmark Suite from Web Search Engines Wanling Gao, Yuqing Zhu, Zhen Jia, Chunjie Luo, Lei Wang, Jianfeng Zhan, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, and Bizhu
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationDremel: Interac-ve Analysis of Web- Scale Datasets
Dremel: Interac-ve Analysis of Web- Scale Datasets Google Inc VLDB 2010 presented by Arka BhaEacharya some slides adapted from various Dremel presenta-ons on the internet The Problem: Interactive data
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationOracle Big Data. A NA LYT ICS A ND MA NAG E MENT.
Oracle Big Data. A NALYTICS A ND MANAG E MENT. Oracle Big Data: Redundância. Compatível com ecossistema Hadoop, HIVE, HBASE, SPARK. Integração com Cloudera Manager. Possibilidade de Utilização da Linguagem
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationESSnet Big Data WP2: Webscraping Enterprise Characteristics
ESSnet Big Data WP2: Webscraping Enterprise Characteristics Methodological note The ESSnet BD WP2 performs joint web scraping experiments following in multiple countries, using as much as possible the
More informationTechnical Brief: Specifying a PC for Mascot
Technical Brief: Specifying a PC for Mascot Matrix Science 8 Wyndham Place London W1H 1PP United Kingdom Tel: +44 (0)20 7723 2142 Fax: +44 (0)20 7725 9360 info@matrixscience.com http://www.matrixscience.com
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationM3-R3: INTERNET AND WEB DESIGN
M3-R3: INTERNET AND WEB DESIGN NOTE: 1. There are TWO PARTS in this Module/Paper. PART ONE contains FOUR questions and PART TWO contains FIVE questions. 2. PART ONE is to be answered in the TEAR-OFF ANSWER
More informationCS November 2017
Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationDistributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationCS290N Summary Tao Yang
CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website. [MRS] Christopher
More informationESSnet BD SGA2. WP2: Web Scraping Enterprises, NL plans Gdansk meeting. Olav ten Bosch, Dick Windmeijer, Oct 4th 2017
ESSnet BD SGA2 WP2: Web Scraping Enterprises, NL plans Gdansk meeting Olav ten Bosch, Dick Windmeijer, Oct 4th 2017 Contents SGA1 results SGA2 plans 2 Legal (3) August 14 th 2017: - The world is still
More informationBig Data Appliance in Risk Management
Big Data Appliance in Risk Management Erste Group Bank Jozef Zubricky Group Credit Risk Models and Methods Digital data have predictive power... Web Scenarios with highest predictive power Currency Conversion
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationHuge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2
2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering
More informationBuilding Search Applications
Building Search Applications Lucene, LingPipe, and Gate Manu Konchady Mustru Publishing, Oakton, Virginia. Contents Preface ix 1 Information Overload 1 1.1 Information Sources 3 1.2 Information Management
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationICT & Computing Progress Grid
ICT & Computing Progress Grid Pupil Progress ion 9 Select, Algorithms justify and apply appropriate techniques and principles to develop data structures and algorithms for the solution of problems Programming
More informationPresentation + Integration + Extension delivering business intelligence
Figure 1. BI:Scope Report Display Figure 2. Print Preview Presentation + Integration + Extension delivering business intelligence BI:Scope is a web enabled, rich client, Report Deployment product for business
More informationdata analysis - basic steps Arend Hintze
data analysis - basic steps Arend Hintze 1/13: Data collection, (web scraping, crawlers, and spiders) 1/15: API for Twitter, Reddit 1/20: no lecture due to MLK 1/22: relational databases, SQL 1/27: SQL,
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationProf. Ahmet Süerdem Istanbul Bilgi University London School of Economics
Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationDynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c
2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationOpen Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments
Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationA web application serving queries on renewable energy sources and energy management topics database, built on JSP technology
International Workshop on Energy Performance and Environmental 1 A web application serving queries on renewable energy sources and energy management topics database, built on JSP technology P.N. Christias
More informationInformation we collect in connection with your use of MoreApp's Services; and
Last Updated: November 10, 2015 ScopeThis Privacy Policy applies to information, including Personal Information, MoreApp collects through your use of Moreapp's Services. By using MoreApp's Services and
More informationIndexing Strategies of MapReduce for Information Retrieval in Big Data
International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya
More informationGoogle Apps A Suite for Online Productivity
Google Apps A Suite for Online Productivity About the Technology Training Centre Located within the University of Alberta Professional Private and Group Training Customized Training Individualized Training
More informationInformation Retrieval II
Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning
More informationSpagoBI and Talend jointly support Big Data scenarios
SpagoBI and Talend jointly support Big Data scenarios Monica Franceschini - SpagoBI Architect SpagoBI Competency Center - Engineering Group Big-data Agenda Intro & definitions Layers Talend & SpagoBI SpagoBI
More information95.2% Website review of yoast.com/ Executive Summary
Website review of yoast.com/ Created on 21-08-2018 at 19:12h 95.2% Executive Summary This report analyzes the factors that affect the SEO and usability of yoast.com. The factors are grouped into 6 categories,
More informationMicrosoft Office Access 2013: Part 01. Lesson 01 - Getting Started with Access
Microsoft Office Access 2013: Part 01 Lesson 01 - Getting Started with Access Slide 1 Lesson 01: Getting Started with Access Orientation to Microsoft Access Create a Simple Access Database Get Help in
More informationDATA MINING II - 1DL460. Spring 2017
DATA MINING II - 1DL460 Spring 2017 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt17 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationAn Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based
More informationSearch Engines. Charles Severance
Search Engines Charles Severance Google Architecture Web Crawling Index Building Searching http://infolab.stanford.edu/~backrub/google.html Google Search Google I/O '08 Keynote by Marissa Mayer Usablity
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More information