Methodological challenges of Big Data for official statistics

Size: px
Start display at page:

Download "Methodological challenges of Big Data for official statistics"

Transcription

1 Methodological challenges of Big Data for official statistics Piet Daas Statistics Netherlands THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Content Big Data: properties Big Data: strength and weaknesses Ways to include Big Data in official statistics Important methodological issues And other relevant issues Examples Concluding remarks 2 1

2 Big Data: properties These highly affect the methodological issues Kind of data Characteristics Types of data 3 Big Data: Kind of data NSI Primary data Secondary data Our own questionnaires Data from others - Administrative sources - Big Data Many Big Data sources are predominantly composed of events 4 2

3 Big Data: Characteristics There are more V s 5 Big Data: Types of data In principle, 3 types of Big Data can be discerned 1. Social network data (human sourced) - Facebook, Twitter, Blogs, , Text-messages etc. 2. Traditional system (process mediated) - Bank transaction data, credit cards, medical data 3. Internet of things data (machine generated) - sensor data, GPS data, satellite pictures, etc. However: this is still work in progress Important message: Don t think all Big Data sources are alike (do not generalize a priori) 6 3

4 Big Data: strength and weaknesses (1) Pro s Quickly available (near real time) Lots and lots of data High frequent measurements Usually includes many units 7 Big Data: strength and weaknesses (2) Con s Often composed of events (caused by units) Even those units may not be similar to statistical units (Sometimes) indirect measurements of concepts Volatile and noisy (signal/noise ratio) Can be selective (what part of target population is included?) Stability source and data (effect of maintainers and users) Metadata is not always well described/accurate more in: The Parable of Google Flu: Traps in Big Data Analysis 8 4

5 How can Big Data be included in official statistics? 1. As the only source (replacement/new statistics) - Traffic intensity statistics (NL) and Billion Prices project (MIT) 2. As the main source with survey/admin. data as benchmark - Google trends like approaches, (regular) benchmarking needed 3. As an additional source for a survey/admin. data based statistics - for example to enable small area estimation 4. As supplier of missing data - for example use data on level of education from the internet to fill gaps in education register - But also for now casting and to reduce timeliness! 5. Don t use it 9 Google flu prediction We can learn from this! Models, correlations and changing realities 10 5

6 How can Big Data be included in official statistics? 1. As the only source (replacement/new statistics) - Traffic intensity statistics (NL) and Billion Prices project (MIT) 2. As the main source with survey/admin. data as benchmark - Google trends like approaches, (regular) benchmarking needed 3. As an additional source for a survey/admin. data based statistics - for example to enable small area estimation 4. As supplier of missing data - for example use data on level of education from the internet to fill gaps in education register - But also for nowcasting and to reduce timeliness! 5. Don t use it 11 Most important methodological issues From the above it follows that the essential issues are: 1. Transform events to units (the population frame) 2. Combine Big Data findings with traditional data At the unit level or by correlating, etc. 3. Measure selectivity and correct for it Also related to the stability of the data 4. Determine and improve quality & reduce noise Is this all? 12 6

7 Statistics is all about populations But: not every Big Data sources provides information on the units in the data source Background characteristics (auxiliary variables) are often absent For example: what are the background characteristics of social media users? Take a moment to look this up online! 13 Most important methodological issues From the above it follows that the essential issues are: 1. Transform events to units (the population frame) 2. Combine Big Data findings with traditional data At the unit level or by correlating, etc. 3. Measure selectivity and correct for it Also related to the stability of the data 4. Determine and improve quality & reduce noise 5. Find ways to obtain/derive background characteristics of units in Big Data sources (when needed) 14 7

8 Examples of current methodological state of art 1. Transforming events into units Road sensor data Sensors measure the number of vehicles passing on a specific lane on a specific road in a specific direction on a specific minute What is the population frame of Traffic intensity statistics? The roads!! Therefore one needs to link the sensors to the roads so the number of vehicles per km of road can be calculated!! 15 Road sensor data example There are 20,000 sensors active on Dutch highways They count number of passing vehicle every minute Produce Road intensity statistics Output: Vehicle kilometres (= #vehicles * distance travelled) per highway per COROP area 16 8

9 17 Data per road sensor 18 9

10 Correction method for road sensors Frame Statistics on persons Road intensity Target population (unit=person) Roads (unit=km) Sample Sample of persons All road sensors Data collection Questionnaire Road sensor data Weights Based on (demographic) background characteristics Based on road segment length 19 Road segments Road sensors Main route Road segments (=weights) 20 10

11 Correction Traffic flow Number of vehicles Segment length * * * * *2500 = 107,500 vehicle-km 21 Examples of current methodological state of art 2. Combine Big Data findings with traditional data A. Compare series of development (macro-level) B. Link data sources (micro-level) A. examples: Consumer confidence and social media sentiment (monthly) GDP and Traffic intensities (quarterly) Both Big Data sources measure a related phenomenon (correlation 0.92/0.91) 22 11

12 Correlation example Dutch GDP and Dutch Traffic - GDP - Traffic GDP vs Traffic 3 % increase in GDP corresponds to 12 % increase in traffic Traffic ahead of GDP 1 quarter Correlation 82% from 2010-Q3 till 2014-Q4 91% from 2011-Q2 till 2014-Q4 23 Examples of current methodological state of art 3. Measure selectivity and correct for it Need background characteristics (see 5. when absent) 1. In the case of the road sensors not all roads were found to be covered by a sufficient number of sensors - One road segment had a maximum of 4 sensors, but the number of active sensors was found to vary over time - Another road had 8 sensors with 7 on the right and 1 on the left lane Poor coverages and occasional malfunctioning sensors (missing data) causes bad estimates when not corrected for! 2. Apply model-based & algorithmic ways of inference - Use (advanced) model based approaches (Statistics meets Data Science) 24 12

13 1. Road sensors: poor coverage Number of active sensors may vary over time t1 t2 Poor coverage reduces estimates of number of vehicles Algorithmic inference (1) Does the soup taste well? Non-probability sample ( Big Data) Unknown inclusion probabilities, is precise but biased, inference? 26 13

14 2. Algorithmic inference (2) Compared results of pseudo design-, model- and algorithmic based inference methods on a generated non-probability sample of vehicle odometer values Sample mean (SAM) Pseudo-design based (PDB) Generalized Linear Model (GLM) k-nearest Neighbours (KNN) Artificial Neural Network (ANN) Regression Tree (RTR) Support Vector Machine (SVM) See Buelens et al Algorithmic inference (3) Sample mean (SAM) Average of all units observed y j = y = 1 n i S y i Pseudo-design based (PDB) y j = y h = 1 n h Average of units observed in each stratum i h S y i j h Generalized Linear Model (GLM) Linearized combination of auxiliary variables g(e Y i ) = βx i y j = g 1 (βx j ) 28 14

15 2. Algorithmic inference (4) k-nearest Neighbours (KNN) Average of closely associated k observed units in space X y j = 1 k a A j y a Artificial Neural Network (ANN) Results of trained network of artificial neurons Wikipedia y j = ANN(x j, β) Artificial neuron Network of artificial neurons Algorithmic inference (5) Regression Tree (RTR) Construct binary tree - Maximize between variance - With a stop criterion X 1 b X 1 > b X 1 a X 1 > a X 2 c X 2 > c Average of observed units in each leaf y j = RTR x j, β = 1 n λ y k k λ S Algorithmic version of PDB 30 15

16 2. Algorithmic inference (6) Support Vector Machine (SVM) It represents data as points in space, mapped so that they are divided by a clear gap made as wide as possible. y j = i α i K x i, x j Algorithmic inference (7) training sample test Split sample in training (70%) and test set (30%) Train model with training set data Use trained model to predict test set Compare predictions with observed test data Choose parameter with smallest MSE Model optimisation x y y x y y Estimation Train model with whole sample Predict for the rest of data set target population sample rest 16

17 2. Algorithmic inference (8) Illustration of findings Nonprobability sample of odometer values of cars to estimate km driven per reg. year Extrapolation Variance by bootstrapping 33 Examples of current methodological state of art 4. Determine and improve quality & reduce noise Determine the quality of massive amounts of data is hard Event oriented or unit oriented? Improve quality by removing noise/increasing signal Remove non-relevant part of population (persons/companies), impute missing data, aggregate data over longer period etc. (combine approaches) - Road sensors often miss data on various minutes, imputing these values and applying a filter to smooth the data improves quality - Aggregating social media sentiment data reduces noise, as does applying a (Kalman-)filter. Both improve the correlation with Consumer confidence

18 Measuring quality Event based quality indicators (sensor data) Should be rapidly determined (huge amounts of data) e.g. number of measurements per day for each sensor (L) number or blocks of missing data per day for each sensor (B) mean of number of vehicles detected per day for each sensor (M) number of zero measurements per day for each sensor (O) 35 Improving quality Imputing missing values and smoothing (Bayesian filter) The filter does not introduce extra errors: Precision: 3.6% Accuracy:+0.13% 36 18

19 Examples of current methodological state of art 5. Find ways to obtain/derive background characteristics of units in Big Data sources Make use of the massive amount of data to find clues indicative of important background characteristics of units Use AI/machine learning approaches Determine gender of social media users Studied a sample of 1000 Twitter user accounts in the Netherlands 37 Background characteristics of units An example Dutch Twitter users Only a part of the Dutch are active on Twitter But which part? Determine background characteristics Such as gender, age, income, level of education etc. What are the possibilities? Feature extraction is the way to go Lett s look at gender 38 19

20 4) Picture 1)Name 3) Messages content 2) Short bio Studied a Twitter sample From a list of Dutch Twitter users (~ ) a random sample of 1000 unique ids was drawn Of the sample: 844 profiles still existed 844 had a name 583 provided a short bio 473 created tweets 804 had a non-default picture Default Twitter picture Sample composition: 409 Men (49%) 282 Women (33%) 153 Others (18%) - companies, organizations, dogs, cats, bots

21 Gender findings: 1) First name Used Dutch Voornamenbank website (First name database) Score between 0 and 1 (female male); 676 of 844 (80%) names were registered Unknown names scored -1 (usually companies/organizations) 41 Gender findings: 2) Short bio If a short bio is provided Some people mention there position in the family - Mother, father, papa, mama, son of, etc. 155 of 583 (27%) indicated there gender in short bio (especially women!) Need to check both English and Dutch texts 42 21

22 Gender findings: 3) Tweets content In cooperation with University Twente (Dong Nguyen) Machine learning approach that determines gender specific writing style score Language specific: Messages need to be Dutch! 437 of 473 (92%) persons that created tweets could be classified 43 Gender findings: 4) Profile picture Use OpenCV to process pictures 1) Face recognition 2) Standardisation of faces (resize & rotate) 3) Classify faces according to gender of 804 (75%) profile pictures had 1 or more faces on it 44 22

23 Gender findings: overall results Diagnostic Odds Ratio (log) First name 4.33 Short bio Tweet content Picture (faces) 0.57 Diagnostic Odds Ratio = (TP/FN) / (FP/TN) random guessing log(dor) = 0 Multi-agent findings Need clever ways to combine these Take processing efficiency of the agent into consideration 45 Gender findings: combining approaches Combine findings in the best possible way Unassigned (%) Approach used 844 (100%) 1. Use short bio scores (very precise for females) 689 (82%) 2. Use first name scores 153 (18%) 3. Use Tweet content 29 (3.4%) 4. Use picture 20 (2.4%) 5. Assign male gender Final log(dor) is 7.02, an accuracy of 96.5%! 46 23

24 Concluding remarks Big Data has great potential for official statistics There are many challenges Using Big Data is not like using survey or admin. data There are new methodological challenges Such as extracting features and extracting information from texts and images/videos Learn from others by looking at scientific area such as Artificial Intelligence/Machine learning 47 Questions? 48 24

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino DataBase and Data Mining Group of Data mining fundamentals Data Base and Data Mining Group of Data analysis Most companies own huge databases containing operational data textual documents experiment results

More information

Uses of web scraping for official statistics

Uses of web scraping for official statistics Uses of web scraping for official statistics ESTP course on Big Data Sources Web, Social Media and Text Analytics, Day 1 Olav ten Bosch, Statistics Netherlands THE CONTRACTOR IS ACTING UNDER A FRAMEWORK

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

An introduction to web scraping, IT and Legal aspects

An introduction to web scraping, IT and Legal aspects An introduction to web scraping, IT and Legal aspects ESTP course on Automated collection of online proces: sources, tools and methodological aspects Olav ten Bosch, Statistics Netherlands THE CONTRACTOR

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Fall 09, Homework 5

Fall 09, Homework 5 5-38 Fall 09, Homework 5 Due: Wednesday, November 8th, beginning of the class You can work in a group of up to two people. This group does not need to be the same group as for the other homeworks. You

More information

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University CS423: Data Mining Introduction Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS423: Data Mining 1 / 29 Quote of the day Never memorize something that

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Distracted Driving on the Capital Beltway

Distracted Driving on the Capital Beltway Distracted Driving on the Capital Beltway Contents Introduction 3 Key Findings 4 Methodology 14 2 Distracted Driving on the Capital Beltway Approximately, 210,000 vehicles travel the Capital Beltway (I-495)

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13 CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

More information

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson Machine Learning Duncan Anderson Managing Director, Willis Towers Watson 21 March 2018 GIRO 2016, Dublin - Response to machine learning Don t panic! We re doomed! 2 This is not all new Actuaries adopt

More information

Comparison of various classification models for making financial decisions

Comparison of various classification models for making financial decisions Comparison of various classification models for making financial decisions Vaibhav Mohan Computer Science Department Johns Hopkins University Baltimore, MD 21218, USA vmohan3@jhu.edu Abstract Banks are

More information

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei Data Mining Chapter 1: Introduction Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei 1 Any Question? Just Ask 3 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional

More information

Summary. Machine Learning: Introduction. Marcin Sydow

Summary. Machine Learning: Introduction. Marcin Sydow Outline of this Lecture Data Motivation for Data Mining and Learning Idea of Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication and Regression Examples Data:

More information

Missing Data. Where did it go?

Missing Data. Where did it go? Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing

More information

Discriminate Analysis

Discriminate Analysis Discriminate Analysis Outline Introduction Linear Discriminant Analysis Examples 1 Introduction What is Discriminant Analysis? Statistical technique to classify objects into mutually exclusive and exhaustive

More information

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING CS 7265 BIG DATA ANALYTICS INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, PhD Computer Science,

More information

k-nearest Neighbor (knn) Sept Youn-Hee Han

k-nearest Neighbor (knn) Sept Youn-Hee Han k-nearest Neighbor (knn) Sept. 2015 Youn-Hee Han http://link.koreatech.ac.kr ²Eager Learners Eager vs. Lazy Learning when given a set of training data, it will construct a generalization model before receiving

More information

Topic 7 Machine learning

Topic 7 Machine learning CSE 103: Probability and statistics Winter 2010 Topic 7 Machine learning 7.1 Nearest neighbor classification 7.1.1 Digit recognition Countless pieces of mail pass through the postal service daily. A key

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

CS 188: Artificial Intelligence Fall Machine Learning

CS 188: Artificial Intelligence Fall Machine Learning CS 188: Artificial Intelligence Fall 2007 Lecture 23: Naïve Bayes 11/15/2007 Dan Klein UC Berkeley Machine Learning Up till now: how to reason or make decisions using a model Machine learning: how to select

More information

Classifiers and Detection. D.A. Forsyth

Classifiers and Detection. D.A. Forsyth Classifiers and Detection D.A. Forsyth Classifiers Take a measurement x, predict a bit (yes/no; 1/-1; 1/0; etc) Detection with a classifier Search all windows at relevant scales Prepare features Classify

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Interpretable Machine Learning with Applications to Banking

Interpretable Machine Learning with Applications to Banking Interpretable Machine Learning with Applications to Banking Linwei Hu Advanced Technologies for Modeling, Corporate Model Risk Wells Fargo October 26, 2018 2018 Wells Fargo Bank, N.A. All rights reserved.

More information

Insights JiWire Mobile Audience Insights Report Q4 2012

Insights JiWire Mobile Audience Insights Report Q4 2012 Table of Contents Mobile Audience Trends 2-6 Connected Device Adoption & Trends 7-10 Worldwide Location Highlights 11-12 Public Wi-Fi Trends 13 79.5 % of mobile consumers are influenced by the availability

More information

The Smartphone Consumer June 2012

The Smartphone Consumer June 2012 The Smartphone Consumer 2012 June 2012 Methodology In January/February 2012, Edison Research and Arbitron conducted a national telephone survey offered in both English and Spanish language (landline and

More information

Public Sensing Using Your Mobile Phone for Crowd Sourcing

Public Sensing Using Your Mobile Phone for Crowd Sourcing Institute of Parallel and Distributed Systems () Universitätsstraße 38 D-70569 Stuttgart Public Sensing Using Your Mobile Phone for Crowd Sourcing 55th Photogrammetric Week September 10, 2015 Stuttgart,

More information

NinthDecimal Mobile Audience Q Insights Report

NinthDecimal Mobile Audience Q Insights Report Q2 2012 Insights Report Table of Contents Connected Device Trends 2 Location-Based Behaviors 3-4 52% of on-the-go moms own a tablet 52 % Social Sharing Behaviors 5-7 Connected Device Adoption 8-9 Worldwide

More information

Insights JiWire Mobile Audience Insights Report Q2 2012

Insights JiWire Mobile Audience Insights Report Q2 2012 JiWire Mobile Audience Report JiWire Mobile Audience Report Table of Contents Connected Device Trends 2 Location-Based Behaviors 3-4 Social Sharing Behaviors 5-7 Connected Device Adoption 8-9 Worldwide

More information

How App Ratings and Reviews Impact Rank on Google Play and the App Store

How App Ratings and Reviews Impact Rank on Google Play and the App Store APP STORE OPTIMIZATION MASTERCLASS How App Ratings and Reviews Impact Rank on Google Play and the App Store BIG APPS GET BIG RATINGS 13,927 AVERAGE NUMBER OF RATINGS FOR TOP-RATED IOS APPS 196,833 AVERAGE

More information

Imputation for missing data through artificial intelligence 1

Imputation for missing data through artificial intelligence 1 Ninth IFC Conference on Are post-crisis statistical initiatives completed? Basel, 30-31 August 2018 Imputation for missing data through artificial intelligence 1 Byeungchun Kwon, Bank for International

More information

ITU s work on ICT measurement

ITU s work on ICT measurement ITU s work on ICT measurement WTO Conference on the Use of Data in the Digital Economy 2-3 October 2017 Martin Schaaper Senior ICT Analyst ICT Data and Statistics Division Telecommunication Development

More information

Welfare Navigation Using Genetic Algorithm

Welfare Navigation Using Genetic Algorithm Welfare Navigation Using Genetic Algorithm David Erukhimovich and Yoel Zeldes Hebrew University of Jerusalem AI course final project Abstract Using standard navigation algorithms and applications (such

More information

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques SEVENTH EDITION and EXPANDED SEVENTH EDITION Slide - Chapter Statistics. Sampling Techniques Statistics Statistics is the art and science of gathering, analyzing, and making inferences from numerical information

More information

Industrialising Small Area Estimation at the Australian Bureau of Statistics

Industrialising Small Area Estimation at the Australian Bureau of Statistics Industrialising Small Area Estimation at the Australian Bureau of Statistics Peter Radisich Australian Bureau of Statistics Workshop on Methods in Official Statistics - March 14 2014 Outline Background

More information

IMDB Film Prediction with Cross-validation Technique

IMDB Film Prediction with Cross-validation Technique IMDB Film Prediction with Cross-validation Technique Shivansh Jagga 1, Akhil Ranjan 2, Prof. Siva Shanmugan G 3 1, 2, 3 Department of Computer Science and Technology 1, 2, 3 Vellore Institute Of Technology,

More information

ESPN Analysis. Webpage, Content, Twitter. Produced by: Ashton Keys & Stephen Nisbet

ESPN Analysis. Webpage, Content, Twitter. Produced by: Ashton Keys & Stephen Nisbet ESPN Analysis Webpage, Content, Twitter Produced by: Ashton Keys & Stephen Nisbet Goals As avid sports fans we both are constant user of platforms from our personal experience and experiences expressed

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

Intro to Artificial Intelligence

Intro to Artificial Intelligence Intro to Artificial Intelligence Ahmed Sallam { Lecture 5: Machine Learning ://. } ://.. 2 Review Probabilistic inference Enumeration Approximate inference 3 Today What is machine learning? Supervised

More information

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers A. Salhi, B. Minaoui, M. Fakir, H. Chakib, H. Grimech Faculty of science and Technology Sultan Moulay Slimane

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

TomTom Innovation. Hans Aerts VP Software Development Business Unit Automotive November 2015

TomTom Innovation. Hans Aerts VP Software Development Business Unit Automotive November 2015 TomTom Innovation Hans Aerts VP Software Development Business Unit Automotive November 2015 Empower Movement Simplify complex technology From A to BE Innovative solutions Maps Consumer Connect people and

More information

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach Advanced and Predictive Analytics with JMP 12 PRO JMP User Meeting 9. Juni 2016 -Schwalbach Definition Predictive Analytics encompasses a variety of statistical techniques from modeling, machine learning

More information

Contents. 1. Survey Background and Methodology. 2. Summary of Key Findings. 3. Survey Results. 4. Appendix

Contents. 1. Survey Background and Methodology. 2. Summary of Key Findings. 3. Survey Results. 4. Appendix Mobile Trends 2014 Contents 1. Survey Background and Methodology 2. Summary of Key Findings 3. Survey Results 4. Appendix 2 Research Methodology Method Sample Size Online survey programmed and hosted by

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

Recognition Tools: Support Vector Machines

Recognition Tools: Support Vector Machines CS 2770: Computer Vision Recognition Tools: Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh January 12, 2017 Announcement TA office hours: Tuesday 4pm-6pm Wednesday 10am-12pm Matlab

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

5 Learning hypothesis classes (16 points)

5 Learning hypothesis classes (16 points) 5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

The Rise of the (Modelling) Bots: Towards Assisted Modelling via Social Networks

The Rise of the (Modelling) Bots: Towards Assisted Modelling via Social Networks The Rise of the (Modelling) Bots: Towards Assisted Modelling via Social Networks Sara Perez-Soler, Esther Guerra, Juan de Lara, Francisco Jurado 2017 Presented by Laura Walsh 1 Overview 1. Background &

More information

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University Outline Privacy preserving data publishing: What and Why Examples of privacy attacks

More information

Texting distracted driving behaviour among European drivers: influence of attitudes, subjective norms and risk perception

Texting distracted driving behaviour among European drivers: influence of attitudes, subjective norms and risk perception Texting distracted driving behaviour among European drivers: influence of attitudes, subjective norms and risk perception Alain Areal Authors: Carlos Pires Prevenção Rodoviária Portuguesa, Lisboa, Portugal

More information

2. Basic Task of Pattern Classification

2. Basic Task of Pattern Classification 2. Basic Task of Pattern Classification Definition of the Task Informal Definition: Telling things apart 3 Definition: http://www.webopedia.com/term/p/pattern_recognition.html pattern recognition Last

More information

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland 1 1 University of Michigan School of Public Health April 2011 Complete (Ideal) vs. Observed

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Data Mining. Neural Networks

Data Mining. Neural Networks Data Mining Neural Networks Goals for this Unit Basic understanding of Neural Networks and how they work Ability to use Neural Networks to solve real problems Understand when neural networks may be most

More information

A Comparative Usability Test. Orbitz.com vs. Hipmunk.com

A Comparative Usability Test. Orbitz.com vs. Hipmunk.com A Comparative Usability Test Orbitz.com vs. Hipmunk.com 1 Table of Contents Introduction... 3 Participants... 5 Procedure... 6 Results... 8 Implications... 12 Nuisance variables... 14 Future studies...

More information

data-based banking customer analytics

data-based banking customer analytics icare: A framework for big data-based banking customer analytics Authors: N.Sun, J.G. Morris, J. Xu, X.Zhu, M. Xie Presented By: Hardik Sahi Overview 1. 2. 3. 4. 5. 6. Why Big Data? Traditional versus

More information

INTRODUCTION. In this summary version, we present some of the key figures and charts.

INTRODUCTION. In this summary version, we present some of the key figures and charts. 1 INTRODUCTION GWI Market reports track key digital behaviors and penetration levels at a national level, providing the very latest figures for topline engagement as well as analyzing patterns across demographic

More information

Machine Learning Practice and Theory

Machine Learning Practice and Theory Machine Learning Practice and Theory Day 9 - Feature Extraction Govind Gopakumar IIT Kanpur 1 Prelude 2 Announcements Programming Tutorial on Ensemble methods, PCA up Lecture slides for usage of Neural

More information

Data mining fundamentals

Data mining fundamentals Data mining fundamentals Elena Baralis Politecnico di Torino Data analysis Most companies own huge bases containing operational textual documents experiment results These bases are a potential source of

More information

Commission on Parliamentary Reform Statistics on the Scottish Parliament website Key points

Commission on Parliamentary Reform Statistics on the Scottish Parliament website Key points Commission on Parliamentary Reform Statistics on the Scottish Parliament website The website stats were run between May 2015 and May 2016, and social media stats gathered between March 2016 and May 2016

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

towards advanced HR Analytics by Arie-Jan Baan and Bram Eigenhuis

towards advanced HR Analytics by Arie-Jan Baan and Bram Eigenhuis towards advanced HR Analytics by Arie-Jan Baan and Bram Eigenhuis Content #1 advanced Data Analytics (?) #2 data Science Process #3 a case study #4 your case #5 Q&A Who is who? and what is your expectation?

More information

Exam Advanced Data Mining Date: Time:

Exam Advanced Data Mining Date: Time: Exam Advanced Data Mining Date: 11-11-2010 Time: 13.30-16.30 General Remarks 1. You are allowed to consult 1 A4 sheet with notes written on both sides. 2. Always show how you arrived at the result of your

More information

Introduction to Mixed Models: Multivariate Regression

Introduction to Mixed Models: Multivariate Regression Introduction to Mixed Models: Multivariate Regression EPSY 905: Multivariate Analysis Spring 2016 Lecture #9 March 30, 2016 EPSY 905: Multivariate Regression via Path Analysis Today s Lecture Multivariate

More information

CELEBRATING 20 YEARS Q2 2018

CELEBRATING 20 YEARS Q2 2018 CELEBRATING 20 YEARS 2018 1 Areas covered Quarterly tracker - trends in internet usage, social media and the connected home GB face-to-face survey via Ipsos MORI Capibus Latest Wave Quarter 2 2018 (field

More information

WSIS and ICTs: what contributions to Development and SDG?

WSIS and ICTs: what contributions to Development and SDG? WSIS and ICTs: what contributions to Development and SDG? Helani Galpaya UN-DESA 19 th Session Geneva, May 2016 This work was carried out with the aid of a grant from the International Development Research

More information

TRAFFIC SAFETY FACTS. Young Drivers Report the Highest Level of Phone Involvement in Crash or Near-Crash Incidences. Research Note

TRAFFIC SAFETY FACTS. Young Drivers Report the Highest Level of Phone Involvement in Crash or Near-Crash Incidences. Research Note TRAFFIC SAFETY FACTS Research Note DOT HS 811 611 April 2012 Young Drivers Report the Highest Level of Phone Involvement in Crash or Near-Crash Incidences In the first nationally representative telephone

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 15-1: Support Vector Machines Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Mining Concepts Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES Narsaiah Putta Assistant professor Department of CSE, VASAVI College of Engineering, Hyderabad, Telangana, India Abstract Abstract An Classification

More information

Sample: n=2,252 national adults, age 18 and older, including 1,127 cell phone interviews Interviewing dates:

Sample: n=2,252 national adults, age 18 and older, including 1,127 cell phone interviews Interviewing dates: Survey Questions Spring 2013 Tracking Survey Final Topline 5/21/2013 Data for April 17-May 19, 2013 Princeton Survey Research Associates International for the Pew Research Center s Internet & American

More information

Prediction of Dialysis Length. Adrian Loy, Antje Schubotz 2 February 2017

Prediction of Dialysis Length. Adrian Loy, Antje Schubotz 2 February 2017 , 2 February 2017 Agenda 1. Introduction Dialysis Research Questions and Objectives 2. Methodology MIMIC-III Algorithms SVR and LPR Preprocessing with rapidminer Optimization Challenges 3. Preliminary

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Data Mining Technology Based on Bayesian Network Structure Applied in Learning , pp.67-71 http://dx.doi.org/10.14257/astl.2016.137.12 Data Mining Technology Based on Bayesian Network Structure Applied in Learning Chunhua Wang, Dong Han College of Information Engineering, Huanghuai

More information

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider

More information

Slides credited from Dr. David Silver & Hung-Yi Lee

Slides credited from Dr. David Silver & Hung-Yi Lee Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to

More information

Machine Learning - Regression. CS102 Fall 2017

Machine Learning - Regression. CS102 Fall 2017 Machine Learning - Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

Study on the Application Analysis and Future Development of Data Mining Technology

Study on the Application Analysis and Future Development of Data Mining Technology Study on the Application Analysis and Future Development of Data Mining Technology Ge ZHU 1, Feng LIN 2,* 1 Department of Information Science and Technology, Heilongjiang University, Harbin 150080, China

More information

Almost half of all internet users now use search engines on a typical day

Almost half of all internet users now use search engines on a typical day Data Memo BY: Senior Research Fellow Deborah Fallows CONTACT: Associate Director Susannah Fox (202-419-4500) RE: Search Engine Use August 6, 2008 Almost half of all internet users now use search engines

More information

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017 Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017 Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis

More information

EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR

EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR 1.Introductıon. 2.Multi Layer Perception.. 3.Fuzzy C-Means Clustering.. 4.Real

More information

TECH TRACKER QUARTERLY RELEASE: Q4 2013

TECH TRACKER QUARTERLY RELEASE: Q4 2013 TECH TRACKER QUARTERLY RELEASE: Q4 2013 QUARTERLY TRACKER - TRENDS IN INTERNET USAGE, TECH OWNERSHIP AND THE CONNECTED HOME GB FACE TO FACE SURVEY via Ipsos MORI Capibus LATEST WAVE QUARTER 4 2013 (Field

More information

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager

More information

Imputation for missing observation through Artificial Intelligence. A Heuristic & Machine Learning approach

Imputation for missing observation through Artificial Intelligence. A Heuristic & Machine Learning approach Imputation for missing observation through Artificial Intelligence A Heuristic & Machine Learning approach (Test case with macroeconomic time series from the BIS Data Bank) Byeungchun Kwon Bank for International

More information

Economics Nonparametric Econometrics

Economics Nonparametric Econometrics Economics 217 - Nonparametric Econometrics Topics covered in this lecture Introduction to the nonparametric model The role of bandwidth Choice of smoothing function R commands for nonparametric models

More information

Available online at ScienceDirect. Procedia Computer Science 60 (2015 )

Available online at   ScienceDirect. Procedia Computer Science 60 (2015 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 60 (2015 ) 1720 1727 19th International Conference on Knowledge Based and Intelligent Information and Engineering Systems

More information