URLs identification task: Istat current status. Istat developed and applied a procedure consisting of the following steps:

Similar documents
Istat SW for webscraping

Istat s Pilot Use Case 1

Web scraping. Donato Summa. 3 WP1 face to face meeting September 2017 Thessaloniki (EL)

ON THE USE OF INTERNET AS A DATA SOURCE FOR OFFICIAL STATISTICS: A STRATEGY FOR IDENTIFYING ENTERPRISES ON THE WEB 1

ESSnet Big Data WP2: Webscraping Enterprise Characteristics

Stat 4510/7510 Homework 4

Hands-on immersion on Big Data tools. Extracting data from the web

Discriminant analysis in R QMMA

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Extracting data from the web

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Binary Regression in S-Plus

Template for data breach notifications I

Overview Of The German Energy Efficiency Market

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Multinomial Logit Models with R

1 The SAS System 23:01 Friday, November 9, 2012

Classification. Data Set Iris. Logistic Regression. Species. Petal.Width

Dynamic Network Regression Using R Package dnr

GLOBAL TRADE REPOSITORY COUNTERPARTY REFERENCE DATA (CRD) UTILITY USER GUIDE

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

Generalized Additive Models

The trade in Value Added of the European Union regions

Technological developments contributing to emergence of digital technologies:

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Introduction to R, Github and Gitlab

Unit 5 Logistic Regression Practice Problems

STATS PAD USER MANUAL

Review of UK Big Data EssNet WP2 SGA1 work. WP2 face-to-face meeting, 4/10/17

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Product Catalog. AcaStat. Software

SAS Enterprise Miner : Tutorials and Examples

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

ISTAT Farm Register: Data Collection by Using Web Scraping for Agritourism Farms

run ld50 /* Plot the onserved proportions and the fitted curve */ DATA SETR1 SET SETR1 PROB=X1/(X1+X2) /* Use this to create graphs in Windows */ gopt

ESSnet BD SGA2. WP2: Web Scraping Enterprises, NL plans Gdansk meeting. Olav ten Bosch, Dick Windmeijer, Oct 4th 2017

Poisson Regression and Model Checking

Nina Zumel and John Mount Win-Vector LLC

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

APPENDIX D. RESIDENTIAL AND FIRM LOCATION CHOICE MODEL SPECIFICATION AND ESTIMATION

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

An introduction to SPSS

As a reference, please find a version of the Machine Learning Process described in the diagram below.

Perceptive Matching Engine

GxE.scan. October 30, 2018

Chapitre 2 : modèle linéaire généralisé

Table 5.5. Industry Employment Opportunity Grades, 2012: Native Americans

Programming Exercise 6: Support Vector Machines

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Correctly Compute Complex Samples Statistics

Package optimus. March 24, 2017

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Exponential Random Graph Models for Social Networks

Chapter 6: Linear Model Selection and Regularization

Programming Exercise 3: Multi-class Classification and Neural Networks

Section 2.1: Intro to Simple Linear Regression & Least Squares

Hoover s Optimizer User s Guide

FIRST QUARTER 2018 CHARLOTTE-MECKLENBURG GROWTH REPORT

Using R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures

Regression on the trees data with R

Correctly Compute Complex Samples Statistics

Slice Intelligence!

Sub-regional analysis

QTP MOCK TEST QTP MOCK TEST II

Generalized Additive Model

Salary 9 mo : 9 month salary for faculty member for 2004

MULTI-FINGER PENETRATION RATE AND ROC VARIABILITY FOR AUTOMATIC FINGERPRINT IDENTIFICATION SYSTEMS

Package smbinning. December 1, 2017

K- Nearest Neighbors(KNN) And Predictive Accuracy

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

US Geo-Explorer User s Guide. Web:

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Data Analyst Nanodegree Syllabus

Boolean Classification

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

The Labour Cost Index decreased by 1.5% when compared to the same quarter in 2017

MATLAB - Lecture # 4

Potential business impact of cybercrime on small and medium enterprises (SMEs) in 2016 Survey report USA. October, 2016

CSV file Layout for VAT Annex Details

Chapter 10: Extensions to the GLM

Fathom Dynamic Data TM Version 2 Specifications

The linear mixed model: modeling hierarchical and longitudinal data

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Deltek PM Compass Cumulative Update Release Notes Page 1 of 8

Certificate of Mailing USPS Customer Webinar

Easy Comext (or Easy XTnet) is an HTML based interface giving to the public at Eurostat s External Trade database.

READSPEAKER BLACKBOARD BUILDING BLOCK

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Scheduling on Parallel Systems. - Sathish Vadhiyar

SYS 6021 Linear Statistical Models

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

A Hybrid Recommender System for Dynamic Web Users

Function Algorithms: Linear Regression, Logistic Regression

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Repetition Structures Chapter 9

Solution to Bonus Questions

CS 1520 / CoE 1520: Programming Languages for Web Applications (Spring 2013) Department of Computer Science, University of Pittsburgh

Transcription:

ESSnet BIG DATA WorkPackage 2 URLs identification task: Istat current status Giulio Barcaroli, Monica Scannapieco, Donato Summa Istat developed and applied a procedure consisting of the following steps: 1. identification of the target population of enterprises: for the current test, the population includes 188,000 enterprises in industry and services (Manufacturing; Electricity, gas and steam, water supply, sewerage and waste management; Construction; Wholesale and retail trade repair of motor vehicles and motorcycles; Transportation and storage; Accommodation and food service activities; Information and communication; Real estate activities; Professional, scientific and technical activities; Administrative and support activities; Repair of computers) with at least 10 employees; 2. access to already available URLs for 73,000 enterprises (from 3 different rounds of the ICT survey and from third part data (Consodata); 3. for each one of these 73,000: I. use of search engine BING a. input: denomination of the enterprise b. output: first 10 links II. for each link: A. access to the website, search for a number of items: a) telephone number b) VAT code c) Municipality d) province abbreviation e) zip code B. computation of a score based on the above items (found: yes/no) 4. for each enterprise, the link with the highest score is selected, and the resulting set is used to fit a logistic model in order to estimate the probability of having found a correct link, based upon on the predictors: telephone number, VAT code, municipality, province, zip code (found: yes/no) 1

5. application of the logistic model to the links found (same as in step 3) for the remaining enterprises. Using thresholds that minimize false positives, we found; o 22,218 reliable links o 10,260 to be checked (with a crowdsourcing platform). With the above steps, we obtain a total of about 95,000 URLs without interactive controls and about 105,000 URLs with interactive controls on an estimated total number of 130,000. Details (For references to processing modules, see the Appendix) Computation of the scores for each link For each link scraped by UrlCrawler the UrlScorer program generates a vector of numbers, something like this : 207101 Every position in the vector contains a number that usually means: a specific characteristic in the web page was found or not (eg. presence of the telephone number) For each element of the array we calculated the confusion matrix that would have been obtained if we used just that element for classification purposes. Based on that confusion matrix we computed standard performance measures: o precision o recall o f-measure (harmonic mean of precision and recall) Then we assigned as raw weights to the elements of the vector the f-measures of the corresponding confusion matrixes. Lastly we normalized the raw weights so that the sum of the final (adjusted) weights is 1000. In order to assign a final score to a link we summed the normalized weights of those elements that were actually present in the vector returned by UrlScorer. 2

Logistic model The script for preparing data and fitting the logistic model is LogisticModel.R (see in Appendix 1 and Appendix 2). The set of 73,000 enterprises links has been divided in equal parts in a train and in a test set. The fitted model was the following: glm(formula = correct_yes_no ~., family = binomial(logit), data = train[, 2:ncol(train)]) Deviance Residuals: Min 1Q Median 3Q Max -2.3971-0.7113 0.3415 0.6040 2.6650 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -3.629861 0.055026-65.966 < 2e-16 *** telephone 0.107994 0.026948 4.007 6.14e-05 *** simpleurl 0.797211 0.028717 27.761 < 2e-16 *** link_position 0.298217 0.005246 56.846 < 2e-16 *** VAT 1.997683 0.028593 69.865 < 2e-16 *** municipality 0.262655 0.040320 6.514 7.30e-11 *** province 0.487161 0.040835 11.930 < 2e-16 *** zip_code -0.002322 0.032067-0.072 0.942 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 60679 on 43807 degrees of freedom Residual deviance: 39806 on 43800 degrees of freedom AIC: 39822 Number of Fisher Scoring iterations: 5 All the predictors are influent with the exception of the zip_code. Once applied to the test set, the results have been the following: Logistic with 10 classes Upper four classes taken as correct links score_class true false classification_error group 1 [-3.52,-2.37] 487 3963 0.89056180 1 2 (-2.37,-1.62] 506 3853 0.88391833 2 3 (-1.62,-0.985] 698 3899 0.84816184 3 4 (-0.985,-0.389] 1168 3290 0.73799910 4 5 (-0.389,-0.0407] 2424 2667 0.52386565 5 6 (-0.0407,0.707] 2331 1153 0.33094145 6 7 (0.707,1.61] 3617 795 0.18019039 7 8 (1.61,2.06] 3711 514 0.12165680 8 3

9 (2.06,2.7] 4325 494 0.10251089 9 10 (2.7,2.81] 3575 339 0.08661216 10 Recall: 0.6666667 Precision: 0.8766839 F1 measure: 0.7573859 Lower five classes taken as incorrect links Recall: 0.2312845 Precision: 0.2301459 F1 measure: 0.2307138 Percentage of enterprises to submit to crowdsourcing: 0.07952704 with expected positives: 0.1020489 In other words, if we set a take-positive threshold to 0.707, we take all records exceeding this value as positive. In this case, we cover the 66,6% of all positives, with an error of 12,3%. If we set a take-negative threshold to -0.0407, we consider as negatives all records below this value, thus excluding 23,1% of positives. By considering the 6 th class as to submit to crowdsourcing, we expect it contains the remaining 10% of positives. + 4

Appendix 1 PROCEDURE FOR LINKS RETRIEVAL AND SCORE ASSIGNMENT The procedure contains several steps: Step number 1 A custom java program (UrlSearcher) takes in input 2 files containing the list of the firm names and the corresponding list of firm IDs. For each enterprise the program queries a search engine (we are actually using Bing) and retrieves the list of the first 10 urls provided by the search engine, these urls are then stored in a file so we will have one file for each firm. At the end, the program reads each produced file and creates the seed file that is a txt file within wich every row is so composed: url_retrieved + TAB + firm_id + TAB + position_of_the_url eg: www.mywebsite.com/aboutus 111 3 Step number 2 A custom java program (UrlCrawler) takes in input 3 files: 1) The seed file just produced 2) A list of url domains to avoid (usually directories domains) 3) A configuration file and for each row of the seed file (if the url is not in the list of the domains to avoid) tries to acquire the HTML page. From each acquired HTML page the program selects just the textual content of the fields we are interested in and write a line in a CSV file. The structure of each row is this: id + TAB + url + TAB + imgsrc + TAB + imgalt + TAB + links + TAB + ahref + TAB + aalt + TAB + inputvalue + TAB + inputname + TAB + metatagdescription + TAB + metatagkeywords + TAB + firmid + TAB + sitoazienda + TAB + link_position + TAB + title + TAB + text_of_the_pagebody Step number 3 5

The CSV file is imported in the storage platform (we are actually using Solr). Step number 4 A custom java program (UrlScorer) takes in input : 1) A CSV file containing several rows, each row contains several information about a specific firm (ID, name, VAT code, ZIP code, telephone number, etc ) 2) A reference to the storage platform For each row of the file (that is for each firm) the program extracts from the storage platform all the documents previously stored for that firm. At each document the program calculate a score vector and assigns a score on the basis of predefined conditions (eg: in the text of the document we can find the vat code or the telephone number ) and writes down a row in a CSV file with the following structure: firmid + TAB + linkposition + TAB + url + TAB + scorevector + TAB score The score vector is a vector with the following structure : 2111111 From the left side to the right side the meaning of each number is the following : 1) Telephone number found in the text (2 found, 1 not found) 2) Url in a simple form like www.firmname.com (1 yes, 0 no ) 3) Link position (9 1st position, 8 2nd position, 7 3th position,..., 0 10th position) 4) VAT code found in the text (1 yes, 0 no ) 5) Municipality name found in the text (1 yes, 0 no ) 6) Province abbreviation found in the text (1 yes, 0 no ) 7) Zip code found in the text (1 yes, 0 no ) The score is an integer number from 0 to 1000. Step number 5 The CSV file obtained in the step 4 is the input for the machine learning program that will try to predict the correct url for each firm. In order to be able to accomplish this task you have to instruct the learner in advance providing it a test set (a similar CSV file containing the extra boolean column is the found url correct"). We created this test set using a custom Java program (UrlMatchTableGenerator) that merges the CSV file from step 4 with a list of correct sites. 6

Appendix 2 PROCEDURE FOR LOGISTIC MODEL FITTING AND PREDICTION Script LogisticModel.R to fit the model ###################################################### # Script to fit a logistic model to links # resulting from the use of the search engine # having in input the denomination of the enterprises ###################################################### # Read data url <- read.csv("punteggipergiulio_18_05_16.csv") url <- url[,c("cod_azienda","score_vector","score","match_domain_no_ext")] # Consider links for each enterprise a <- split(url,url$cod_azienda) # Consider the link with maximum score for each enterprise url1 <- NULL for (i in 1:length(a)) { url1 <- rbind(url1,a[[i]][a[[i]]$score == max(a[[i]]$score),][1,]) cat("\n",i) } save(url1,file="urls_match_domain_no_ext.rdata") url <- url1 rm(url1) # Prepare data for logistic model fitting url$score_vector <- as.character(url$score_vector) url2 <- NULL url2$ent_code <- url$cod_azienda url2$correct_yes_no <- url$match_domain_no_ext table(url2$correct_yes_no) url2$telephone <- substring(url$score_vector,1,1) table(url2$telephone) url2$simpleurl <- substring(url$score_vector,2,2) table(url2$simpleurl) url2$link_position <- substring(url$score_vector,3,3) table(url2$link_position) url2$vat <- substring(url$score_vector,4,4) table(url2$vat) url2$municipality <- substring(url$score_vector,5,5) table(url2$municipality) url2$province <- substring(url$score_vector,6,6) table(url2$province) url2$zip_code <- substring(url$score_vector,7,7) table(url2$zip_code) url2 <- as.data.frame(url2) write.table(url2,"accoppiamenti_domain_no_ext.txt",sep="\t",quote=f,row.names=f, col.names=t) urls <- read.delim("accoppiamenti_domain_no_ext.txt") # Prepare training and test set (50-50) set.seed(1234) v <- sample(c(1:nrow(urls)),round(nrow(urls)/2)) 7

train <- urls[v,] test <- urls[-v,] #train <- urls[1:round(nrow(urls)/2),] #test <- urls[(round(nrow(urls)/2)+1):(nrow(urls)),] table(train$correct_yes_no)/sum(table(train$correct_yes_no)) table(test$correct_yes_no)/sum(table(test$correct_yes_no)) t1 <- table(train$correct_yes_no)/nrow(train) t1 t2 <- table(test$correct_yes_no)/nrow(test) t2 quant <- round(t1[1],2) quant observtrain <- train$correct_yes_no observtest <- test$correct_yes_no # ---------------------------------------------------------------- # Logistic model logistic <- glm(correct_yes_no ~., family = binomial(logit), data = train[,2:ncol(train)]) summary(logistic) predictedtrain <- predict(logistic, train[,2:ncol(train)]) hist(predictedtrain) t <- quantile(predictedtrain,quant) predictedtrain <- ifelse(predictedtrain > t,1,0) tavtrain <- table(observtrain,predictedtrain) tavtrain/nrow(train) predicted <- predict(logistic, test[,2:ncol(train)]) summary(predicted) hist(predicted) t <- quantile(predicted,quant) predictedtest <- ifelse(predicted > t,1,0) tavtest <- table(observtest,predictedtest) tavtest/sum(tavtest) ordinati <- cbind(test,predicted) aggr <- ordinati[,c("ent_code","predicted","correct_yes_no")] colnames(aggr) <- c("ident","logistico","corretto") aggr$gruppo <- cut(aggr$logistico, breaks=quantile(aggr$logistico,probs=seq(0,1,0.1)),include.lowest=true) aggr$corretti <- ifelse(aggr$corretto == 1,1,0) aggr$errati <- ifelse(aggr$corretto == 1,0,1) res <- aggregate(aggr[,c("corretti","errati")],by=list(aggr$gruppo),fun=sum) res$errore <- res$errati / (res$errati+res$corretti) res$group <- c(1:nrow(res)) colnames(res) <- c("score_class","true","false","classification_error","group") pdf("plot_logistic_match_domain_no_ext.pdf") plot(res$group,res$true,type="l",col="red",ylim=c(0,max(res$true))) lines(res$group,res$false,type="l",col="blue") legend("topleft","blue=errors, red=correct") title("true and false by class of score (logistic model)") dev.off() 8

write.table(res,"res_logistico_match_domain_no_ext.txt",sep="\t",quote=f,row.nam es=f,col.names=t) sink("logistic_match_domain_no_ext2.txt") attach(res) cat("\n Logistic with 10 classes") cat("\n Upper four classes taken as correct links \n") res recall <- sum(true[7:10])/sum(true) cat("\n Recall: ",recall) precision <- sum(true[7:10])/(sum(true[7:10])+sum(false[7:10])) cat("\n Precision: ",precision) f1 <- 2*(precision*recall)/(precision+recall) cat("\n F1 measure: ",f1) cat("\n Lower five classes taken as incorrect links \n") recall <- sum(true[1:5])/sum(true) cat("\n Recall: ",recall) precision <- sum(true[1:5])/(sum(true[1:5])+sum(false[1:5])) cat("\n Precision: ",precision) f1 <- 2*(precision*recall)/(precision+recall) cat("\n F1 measure: ",f1) perc <- (sum(true[6]+ sum(false[6])))/(sum(true[1:10])+sum(false[1:10])) pos <- (sum(true[6])/(sum(true[1:10]))) cat("\n Percentage of enterprises to submit to crowdsourcing: ", perc) cat("\n with expected positives: ", pos) sink() save.image("urls_19_05_16.rdata") detach(res) #------------------------------------------------------------------------------ 9

Script LogisticModel.R to fit the model ###################################################### # Script to apply the logistic model to links # resulting from the use of the search engine # having in input the denomination of the enterprises ###################################################### # Read data url <- read.delim("punteggitestset.csv") # Consider links for each enterprise a <- split(url,url$firm_id) # Consider the link with maximum score for each enterprise url1 <- NULL for (i in 1:length(a)) { url1 <- rbind(url1,a[[i]][a[[i]]$score == max(a[[i]]$score),][1,]) cat("\n",i) } url <- url1 # Prepare data for logistic model prediction url$score_vector <- as.character(url$score_vector) urls <- NULL urls$firm_id <- url$firm_id urls$link_position <- url$link_position urls$telephone <- as.numeric(substring(url$score_vector,1,1)) table(urls$telephone) urls$simpleurl <- as.numeric(substring(url$score_vector,2,2)) table(urls$simpleurl) urls$link_position <- as.numeric(substring(url$score_vector,3,3)) table(urls$link_position) urls$vat <- as.numeric(substring(url$score_vector,4,4)) table(urls$vat) urls$municipality <- as.numeric(substring(url$score_vector,5,5)) table(urls$municipality) urls$province <- as.numeric(substring(url$score_vector,6,6)) table(urls$province) urls$zip_code <- as.numeric(substring(url$score_vector,7,7)) table(urls$zip_code) urls <- as.data.frame(urls) load("logisticmodel.rdata") urls$predictedscore <- predict(logistic, urls[,2:ncol(urls)]) urls$link <- ifelse(urls$predictedscore >= 0.707,1,0) urls$nonlink <- ifelse(urls$predictedscore <= -0.0407,1,0) urls$crowd <- ifelse(urls$predictedscore > -0.0407 & urls$predictedscore < 0.707,1,0) table(urls$link) table(urls$nonlink) table(urls$crowd) write.table(urls,"links_scored.txt",sep="\t",quote=f,row.names=f,col.names=t) 10