STRATIFIED SAMPLING MEETS MACHINE LEARNING KEVIN LANG KONSTANTIN SHMAKOV EDO LIBERTY
|
|
- Theodora Baldwin
- 6 years ago
- Views:
Transcription
1 STRATIFIED SAMPLING MEETS MACHINE LEARNING KEVIN LANG KONSTANTIN SHMAKOV EDO LIBERTY
2 2
3 3
4 4 Apps with Flurry SDK
5 5
6 q( ) P i q(u i) Examples: Number of event of a certain type Number of unique user Number of unique users in a specific day Total time spent in certain geo Average $ spent by age 6
7 SAMPLING Challenges: P 1. The data is very large. Computing i q(u i) exactly is too costly. 2. The function q( ) is user specified and completely unconstrained. Good News: And approximate answer is acceptable (if the error is small) Solution: Estimate the answer on a random subset of the records 7
8 NOTATIONS q i := q(u i ) for brevity y := P i q i the exact answer for the query q! p i the probability of choosing record i S the set of sampled records, each chosen with probability p i ỹ = P i2s q i/p i the Horvitz-Thompson estimator for! y 8
9 PROPERTIES E[ỹ y] =0 Horvitz-Thompson estimator is unbiased [ỹ y] apple y p 1/( card(q)) its standard deviation isn t large! =min i p i card(q) := P q i / max q i Pr[ ỹ y "y] apple e O("2 card(q)) probability for large error is small card(q) (n)! S 1/" 2 (Olken, Rotem and Hellerstein 1986, and 1990) application to databases (Acharya, Gibbons, Poosala 2000) uniform sampling is best in the worst case 9
10 STRATIFIED SAMPLING Sample = 100,000 US individuals. Query = Republicans vs. Democrats in American Samoa? American Samoa is 0.02% of population Only ~20 from Samoa in the sample Survey error is very large! If card(q) is small S must be large Sample different strata (e.g. US territories) with different probabilities. (Neyman, Jerzy 1934) 10
11 DBLP EXAMPLE Choosing the right strata is hard! 2,101,151 papers 1000 most populous venues Query example title contains learning and # authors <= 3 title contains mechanism and year > 2004 What is the right stratification here? Stratifying by venue made things worse! Stratifying by year was better but still worse than uniform sampling. 11
12 SAMPLING, STRATIFICATION, AND DATABASES Design strata that minimize worst case variance on possible queries Linearly combine strata based on record features Combine stratifies and uniform sampling: Congressional Sampling o Acharya, Gibbons, Poosala 2000: Important idea: consider past queries to the database! Each stratum is a set of records that agree on all queries o Chaudhuri, Das and Narasayya 2007: optimize for the query log Split to two strata, per each query. Take linear combinations o Joshi, Jermaine, 2008: linear combinations of stratified probabilities 12
13 OUR APPROACH Assume queries are drawn from a distribution Q Use the query log Q as a training set (assumed w.r.t. Q ) Allow each record to be sampled with a different probability p i Minimize the Risk This translates to E[(ỹ y) 2 ] X E q Q qi 2 (1/p i 1) i unknown 13
14 OUR APPROACH ERM: Minimize X X qi 2 (1/p i 1) q2q i Query log X P Sampling budget p i c i apple B ( i c i B ) i Regularization 8 i p i 2 [, 1] ( apple B/ P i c i ) 14
15 OUR APPROACH Solve with Lagrange multipliers max [ 1,, Q X X X qi 2 (1/p i 1) i (p i ) q2q i X i i(1 p i ) (B X p i c i )] i i By KKT conditions q p or p i / 1 i = or p i =1 1 c i Q P q2q q2 i 15
16 OUR APPROACH Risk(p) apple Risk(p ) 1+O skew s log(n/ ) Q!! Alg Risk Best Risk Database badness Training Set size 16
17 17 RESULTS
18 RESULTS DBLP Dataset Uniform Sampling p = 1/ Training Queries Training Queries Training Queries Training Queries Expected Error [weaker...] Value of Regularization Parameter Eta [...stronger]
19 RESULTS Cube Dataset Uniform Sampling p = 1/10 50 Training Queries 100 Training Queries 200 Training Queries 800 Training Queries 6400 Training Queries Expected Error [weaker...] Value of Regularization Parameter Eta [...stronger]
20 RESULTS YAM+ Dataset Uniform p = 1/ Training Queries 100 Training Queries 200 Training Queries 400 Training Queries All Training Queries Expected Error [weaker...] Value of Regularization Parameter Eta [...stronger]
21 RESULTS 10 1 YAM+ Dataset Regularized ERM Uniform Sampling Expected Error e e+06 Numeric Cardinality of Test Query 21
22 RESULTS YAM+ Dataset Uniform Sampling Regularized ERM Probability (Rescaled) Average Error
23 RESULTS 10 1 YAM+ Dataset Uniform Sampling Regularized ERM Expected Error Sampling Rate = Budget / (Total Cost) 23
24 RESULTS Cube Dataset ERM Uniform Sampling Expected Error Numeric Cardinality of Test Query 24
25
Stratified Reservoir Sampling over Heterogeneous Data Streams
Stratified Reservoir Sampling over Heterogeneous Data Streams Mohammed Al-Kateb and Byung Suk Lee Department of Computer Science, The University of Vermont, Burlington VT, USA {malkateb,bslee}@cs.uvm.edu
More informationSAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE QUERY PROCESSING
SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE QUERY PROCESSING By SHANTANU JOSHI A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
More informationarxiv: v3 [cs.db] 20 Feb 2018
Variance-Optimal Offline and Streaming Stratified Random Sampling Trong Duc Nguyen 1 Ming-Hung Shih 1 Divesh Srivastava 2 Srikanta Tirthapura 1 Bojian Xu 3 arxiv:1801.09039v3 [cs.db] 20 Feb 2018 1 Iowa
More informationComputing Optimal Strata Bounds Using Dynamic Programming
Computing Optimal Strata Bounds Using Dynamic Programming Eric Miller Summit Consulting, LLC 7/27/2012 1 / 19 Motivation Sampling can be costly. Sample size is often chosen so that point estimates achieve
More informationCITS4009 Introduction to Data Science
School of Computer Science and Software Engineering CITS4009 Introduction to Data Science SEMESTER 2, 2017: CHAPTER 4 MANAGING DATA 1 Chapter Objectives Fixing data quality problems Organizing your data
More informationSTAT10010 Introductory Statistics Lab 2
STAT10010 Introductory Statistics Lab 2 1. Aims of Lab 2 By the end of this lab you will be able to: i. Recognize the type of recorded data. ii. iii. iv. Construct summaries of recorded variables. Calculate
More informationLinked Bernoulli Synopses: Sampling Along Foreign Keys
Linked Bernoulli Synopses: Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Database Technology Group Technische Universität Dresden, Germany {gemulla,roesch,lehner}@inf.tu-dresden.de
More informationOvercoming Limitations of Sampling for Aggregation Queries
Overcoming Limitations of Sampling for Aggregation Queries Surajit Chaudhuri Microsoft Research surajitc@microsoft.com Gautam Das Microsoft Research gautamd@microsoft.com Rajeev Motwani Department of Computer
More informationIN-CLASS EXERCISE: INTRODUCTION TO THE R "SURVEY" PACKAGE
NAVAL POSTGRADUATE SCHOOL IN-CLASS EXERCISE: INTRODUCTION TO THE R "SURVEY" PACKAGE Survey Research Methods (OA4109) Lab: Introduction to the R survey Package Goal: Introduce students to the R survey package.
More informationDescribing PPS designs to R
Describing PPS designs to R Thomas Lumley March 11, 2013 The survey package has always supported PPS (ie, arbitrary unequal probability) sampling with replacement, or using the with-replacement single-stage
More informationOptimizing Speech Recognition Evaluation Using Stratified Sampling
INTERSPEECH 01 September 1, 01, San Francisco, USA Optimizing Speech Recognition Evaluation Using Stratified Sampling Janne Pylkkönen, Thomas Drugman, Max Bisani Amazon {jannepyl, drugman, bisani}@amazon.com
More informationUser s guide to R functions for PPS sampling
User s guide to R functions for PPS sampling 1 Introduction The pps package consists of several functions for selecting a sample from a finite population in such a way that the probability that a unit
More informationA Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu
More informationMULTI-DIMENSIONAL MONTE CARLO INTEGRATION
CS580: Computer Graphics KAIST School of Computing Chapter 3 MULTI-DIMENSIONAL MONTE CARLO INTEGRATION 2 1 Monte Carlo Integration This describes a simple technique for the numerical evaluation of integrals
More informationSetting Up the Randomization Module in REDCap How-To Guide
Setting Up the Randomization Module in REDCap How-To Guide REDCAP Randomization Module Randomization is a process that assigns participants/subjects by chance (rather than by choice) into specific groups,
More informationOvercoming Limitations of Sampling for Aggregation Queries
Overcoming Limitations of Sampling for Aggregation Queries Surajit Chaudhuri Gautam Das Mayur Datar * Microsoft Research Microsoft Research Department of Computer Science surajitc@microsoft.com gautamd@microsoft.com
More informationOptimized Stratified Sampling for Approximate Query Processing
Optimized Stratified Sampling for Approximate Query Processing SURAJIT CHAUDHURI Microsoft Research GAUTAM DAS University of Texas at Arlington and VIVEK NARASAYYA Microsoft Research The ability to approximately
More informationarxiv: v1 [cs.lg] 14 May 2013
Scalable Audience Reach Estimation in Real-time Online Advertising Ali Jalali ajalali@turn.com Turn Inc Santanu Kolay skolay@turn.com Turn Inc Peter Foldes pfoldes@turn.com Turn Inc Ali Dasdan adasdan@turn.com
More informationEfficient Skew Handling in Online Aggregation in the Cloud
Efficient Skew Handling in Online Aggregation in the Cloud Xiang Ci Email: cixiang@ruc.edu.cn Fengming Wang Email: rucwfm1991@ruc.edu.cn Yantao Gan Email: ganyantao@ruc.edu.cn Xiaofeng Meng Email: xfmeng@ruc.edu.cn
More informationApplied Survey Data Analysis Module 2: Variance Estimation March 30, 2013
Applied Statistics Lab Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013 Approaches to Complex Sample Variance Estimation In simple random samples many estimators are linear estimators
More informationModel Complexity and Generalization
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Generalization Learning Curves Underfit Generalization
More informationGoal programming and Lexicographic Goal Programming Apporches in bi-objective Stratified Sampling: An Integer Solution
Goal programming and Lexicographic Goal Programming Apporches in bi-objective Stratified Sampling: An Integer Solution 5.1 Introduction Goal programming (GP) was proposed by Charnes and Cooper (1961) to
More informationSampling Statistics Guide. Author: Ali Fadakar
Sampling Statistics Guide Author: Ali Fadakar An Introduction to the Sampling Interface Sampling interface is an interactive software package that uses statistical procedures such as random sampling, stratified
More informationPSS718 - Data Mining
Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the
More informationarxiv: v1 [cs.lg] 18 Feb 2009
An Exact Algorithm for the Stratification Problem with Proportional Allocation José Brito arxiv:0902.3223v1 [cs.lg] 18 Feb 2009 IBGE, Instituto Brasileiro de Geografia e Estatística.Comeq, 20031-170, Rio
More informationAnswering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005)
Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005) Foto N. Afrati Computer Science Division NTUA, Athens, Greece afrati@softlab.ece.ntua.gr
More informationTargeted Random Sampling for Reliability Assessment: A Demonstration of Concept
Illinois Institute of Technology; Chicago, IL Targeted Random Sampling for Reliability Assessment: A Demonstration of Concept Michael D. Shields Assistant Professor Dept. of Civil Engineering Johns Hopkins
More informationLocal Pivotal Methods for Large Surveys
Local Pivotal Methods for Large Surveys Jonathan J. Lisic Nathan B. Cruze Abstract The local pivotal method described provides a method to draw a spatially balanced sample for an arbitrary number of dimensions
More informationDatabase Management System
Database Management System Lecture Join * Some materials adapted from R. Ramakrishnan, J. Gehrke and Shawn Bowers Today s Agenda Join Algorithm Database Management System Join Algorithms Database Management
More informationAverages and Variation
Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus
More informationApproximate Query Processing: What is New and Where to Go?
Data Science and Engineering (2018) 3:379 397 https://doi.org/10.1007/s41019-018-0074-4 Approximate Query Processing: What is New and Where to Go? A Survey on Approximate Query Processing Kaiyu Li 1 Guoliang
More informationDual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey
Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey Kanru Xia 1, Steven Pedlow 1, Michael Davern 1 1 NORC/University of Chicago, 55 E. Monroe Suite 2000, Chicago, IL 60603
More informationDifferentially Private H-Tree
GeoPrivacy: 2 nd Workshop on Privacy in Geographic Information Collection and Analysis Differentially Private H-Tree Hien To, Liyue Fan, Cyrus Shahabi Integrated Media System Center University of Southern
More informationABSTRACT 1 INTRODUCTION
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate ueries Surait Chaudhuri Gautam Das Vivek Narasayya Microsoft Research Microsoft Research Microsoft Research One Microsoft Way,
More informationSampling from databases using B + -Trees
Intelligent Data Analysis 6 (2002) 1 19 IDA98 1 IOS Press from databases using B + -Trees Dimuthu Makawita a, Kian-Lee Tan b and Huan Liu c a School of ICT, NgeeAnn Polytechnic, 535 Clementi Road, Singapore
More informationVariance Estimation in Presence of Imputation: an Application to an Istat Survey Data
Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi 1 Introduction Imputation is the commonly used
More informationAll lecture slides will be available at CSC2515_Winter15.html
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many
More informationWorkload-Driven Antijoin Cardinality Estimation
0 Workload-Driven Antijoin Cardinality Estimation FLORIN RUSU and ZIXUAN ZHUANG, University of California Merced MINGXI WU, GraphSQL CHRIS JERMAINE, Rice University Antijoin cardinality estimation is among
More informationHolistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs
Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International
More informationStatistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland
Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland 1 1 University of Michigan School of Public Health April 2011 Complete (Ideal) vs. Observed
More informationThe Use of Sample Weights in Hot Deck Imputation
Journal of Official Statistics, Vol. 25, No. 1, 2009, pp. 21 36 The Use of Sample Weights in Hot Deck Imputation Rebecca R. Andridge 1 and Roderick J. Little 1 A common strategy for handling item nonresponse
More informationWorkload-Driven Antijoin Cardinality Estimation
Workload-Driven Antijoin Cardinality Estimation FLORIN RUSU and ZIXUAN ZHUANG, University of California, Merced MINGXI WU, GraphSQL CHRIS JERMAINE, Rice University 16 Antijoin cardinality estimation is
More informationHILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008
HILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008 HILDA Standard Errors: A Users Guide Clinton Hayes The HILDA Project was initiated, and is funded, by the Australian Government Department of
More informationGoing nonparametric: Nearest neighbor methods for regression and classification
Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 8, 28 Locality sensitive hashing for approximate NN
More informationGoing nonparametric: Nearest neighbor methods for regression and classification
Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 3, 208 Locality sensitive hashing for approximate NN
More informationCondence Intervals about a Single Parameter:
Chapter 9 Condence Intervals about a Single Parameter: 9.1 About a Population Mean, known Denition 9.1.1 A point estimate of a parameter is the value of a statistic that estimates the value of the parameter.
More informationREDCap Randomization Module
REDCap Randomization Module The Randomization Module within REDCap allows researchers to randomly assign participants to specific groups. Table of Contents 1 Table of Contents 2 Enabling the Randomization
More informationResearch with Large Databases
Research with Large Databases Key Statistical and Design Issues and Software for Analyzing Large Databases John Ayanian, MD, MPP Ellen P. McCarthy, PhD, MPH Society of General Internal Medicine Chicago,
More informationBasic Medical Statistics Course
Basic Medical Statistics Course S0 SPSS Intro December 2014 Wilma Heemsbergen w.heemsbergen@nki.nl This Afternoon 13.00 ~ 15.00 SPSS lecture Short break Exercise 2 Database Example 3 Types of data Type
More informationThe Importance of Modeling the Sampling Design in Multiple. Imputation for Missing Data
The Importance of Modeling the Sampling Design in Multiple Imputation for Missing Data Jerome P. Reiter, Trivellore E. Raghunathan, and Satkartar K. Kinney Key Words: Complex Sampling Design, Multiple
More informationChapter 6 The Standard Deviation as Ruler and the Normal Model
ST 305 Chapter 6 Reiland The Standard Deviation as Ruler and the Normal Model Chapter Objectives: At the end of this chapter you should be able to: 1) describe how adding or subtracting the same value
More informationHow Learning Differs from Optimization. Sargur N. Srihari
How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical
More informationelements of a relation are called tuples. Every tuple represents an entity or a relationship. Country name code population area...
2.2.1 Relations A (database) state associates each relation schema to a relation. elements of a relation are called tuples. Every tuple represents an entity or a relationship. (: Asia, area: 4.5E7) relations
More informationMonte Carlo Integration
Chapter 2 Monte Carlo Integration This chapter gives an introduction to Monte Carlo integration. The main goals are to review some basic concepts of probability theory, to define the notation and terminology
More informationChapter 2: The Normal Distribution
Chapter 2: The Normal Distribution 2.1 Density Curves and the Normal Distributions 2.2 Standard Normal Calculations 1 2 Histogram for Strength of Yarn Bobbins 15.60 16.10 16.60 17.10 17.60 18.10 18.60
More informationCorrectly Compute Complex Samples Statistics
SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample
More informationPost-stratification and calibration
Post-stratification and calibration Thomas Lumley UW Biostatistics WNAR 2008 6 22 What are they? Post-stratification and calibration are ways to use auxiliary information on the population (or the phase-one
More informationECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1
ECT7110 Data Preprocessing Prof. Wai Lam ECT7110 Data Preprocessing 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest,
More informationAnalysis of Complex Survey Data with SAS
ABSTRACT Analysis of Complex Survey Data with SAS Christine R. Wells, Ph.D., UCLA, Los Angeles, CA The differences between data collected via a complex sampling design and data collected via other methods
More informationImproved Sampling Weight Calibration by Generalized Raking with Optimal Unbiased Modification
Improved Sampling Weight Calibration by Generalized Raking with Optimal Unbiased Modification A.C. Singh, N Ganesh, and Y. Lin NORC at the University of Chicago, Chicago, IL 663 singh-avi@norc.org; nada-ganesh@norc.org;
More information3. Data Preprocessing. 3.1 Introduction
3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation
More informationCentral Limit Theorem Sample Means
Date Central Limit Theorem Sample Means Group Member Names: Part One Review of Types of Distributions Consider the three graphs below. Match the histograms with the distribution description. Write the
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More informationChapter 6 Normal Probability Distributions
Chapter 6 Normal Probability Distributions 6-1 Review and Preview 6-2 The Standard Normal Distribution 6-3 Applications of Normal Distributions 6-4 Sampling Distributions and Estimators 6-5 The Central
More informationPre-Calculus Multiple Choice Questions - Chapter S2
1 Which of the following is NOT part of a univariate EDA? a Shape b Center c Dispersion d Distribution Pre-Calculus Multiple Choice Questions - Chapter S2 2 Which of the following is NOT an acceptable
More informationSTP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES
STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use
More informationT H E S H I F T T O SMARTPHONE DOMINANCE
T H E S H I F T T O SMARTPHONE DOMINANCE Background To understand mobile migration patterns and which factors will accelerate the shift to a mobile-first for consumers and advertisers W H A T S C O V E
More informationThings you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.
1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.
More informationCompressing SQL Workloads Ashish Kumar Gupta* University of Washington, Seattle 114, Sieg Hall, Box Seattle, WA (206)
Surajit Chaudhuri Microsoft Research One Microsoft Way Redmond WA 98052 +1(425) 703-1938 surajitc@microsoft.com Compressing SQL Workloads Ashish Kumar Gupta* University of Washington, Seattle 114, Sieg
More informationLab 12: Sampling and Interpolation
Lab 12: Sampling and Interpolation What You ll Learn: -Systematic and random sampling -Majority filtering -Stratified sampling -A few basic interpolation methods Videos that show how to copy/paste data
More informationSection 2.3: Simple Linear Regression: Predictions and Inference
Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple
More information3.6 Sample code: yrbs_data <- read.spss("yrbs07.sav",to.data.frame=true)
InJanuary2009,CDCproducedareportSoftwareforAnalyisofYRBSdata, describingtheuseofsas,sudaan,stata,spss,andepiinfoforanalyzingdatafrom theyouthriskbehaviorssurvey. ThisreportprovidesthesameinformationforRandthesurveypackage.Thetextof
More informationBasic Medical Statistics Course
Basic Medical Statistics Course S0 SPSS Intro November 2013 Wilma Heemsbergen w.heemsbergen@nki.nl 1 13.00 ~ 15.30 Database (20 min) SPSS (40 min) Short break Exercise (60 min) This Afternoon During the
More informationWander Join: Online Aggregation via Random Walks
Wander Join: Online Aggregation via Random Walks Feifei Li, Bin Wu 2, Ke Yi 2, Zhuoyue Zhao 3 University of Utah, Salt Lake City, USA 2 Hong Kong University of Science and Technology, Hong Kong, China
More informationLecture 9: Datalog with Negation
CS 784: Foundations of Data Management Spring 2017 Instructor: Paris Koutris Lecture 9: Datalog with Negation In this lecture we will study the addition of negation to Datalog. We start with an example.
More information1. Estimation equations for strip transect sampling, using notation consistent with that used to
Web-based Supplementary Materials for Line Transect Methods for Plant Surveys by S.T. Buckland, D.L. Borchers, A. Johnston, P.A. Henrys and T.A. Marques Web Appendix A. Introduction In this on-line appendix,
More informationSurvey analysis. Thomas Lumley Biostatistics U. Washington
Survey analysis Thomas Lumley Biostatistics U. Washington tlumley@u.washington.edu Issues Survey data sets are often big 10 4-10 5 people, 10 6 businesses 10 3 variables Survey analysts often have conservative
More informationSynthetic Data. Michael Lin
Synthetic Data Michael Lin 1 Overview The data privacy problem Imputation Synthetic data Analysis 2 Data Privacy As a data provider, how can we release data containing private information without disclosing
More informationPrivacy in Statistical Databases
Privacy in Statistical Databases CSE 598D/STAT 598B Fall 2007 Lecture 2, 9/13/2007 Aleksandra Slavkovic Office hours: MW 3:30-4:30 Office: Thomas 412 Phone: x3-4918 Adam Smith Office hours: Mondays 3-5pm
More informationWorkshop Calibration tools for Survey Statisticians
Workshop Calibration tools for Survey Statisticians Comparison of 3 calibration softwares Guillaume Chauvet Jean-Claude Deville Mohammed El Haj Tirari Josiane Le Guennec CREST-ENSAI, France 1/60 09/09/05
More informationProbability and Statistics. Copyright Cengage Learning. All rights reserved.
Probability and Statistics Copyright Cengage Learning. All rights reserved. 14.6 Descriptive Statistics (Graphical) Copyright Cengage Learning. All rights reserved. Objectives Data in Categories Histograms
More informationAlternative Designs for Regression Estimation
Journal of Official Statistics, Vol. 22, No. 3, 2006, pp. 541 563 Alternative Designs for Regression Estimation Mingue Park 1 Restricted random samples partially balanced on sample moments are compared
More informationOne-Pass Streaming Algorithms
One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,
More informationMultiple Regression White paper
+44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms
More informationHiroki Yasuga, Elisabeth Kolp, Andreas Lang. 25th September 2014, Scientific Programming
Hiroki Yasuga, Elisabeth Kolp, Andreas Lang 25th September 2014, Scientific Programming What is sorting and complexity? Big O notation Sorting algorithms: Merge sort Quick sort Comparison: Merge sort &
More informationIntroductory Guide to SAS:
Introductory Guide to SAS: For UVM Statistics Students By Richard Single Contents 1 Introduction and Preliminaries 2 2 Reading in Data: The DATA Step 2 2.1 The DATA Statement............................................
More informationTitle: An Interactive Framework for Spatial Joins: A Statistical Approach for Data Analysis in GIS
Editorial Manager(tm) for Geoinformatica Manuscript Draft Manuscript Number: Title: An Interactive Framework for Spatial Joins: A Statistical Approach for Data Analysis in GIS Article Type: Manuscript
More informationLocal and Global Minimum
Local and Global Minimum Stationary Point. From elementary calculus, a single variable function has a stationary point at if the derivative vanishes at, i.e., 0. Graphically, the slope of the function
More informationFrequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS
ABSTRACT Paper 1938-2018 Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS Robert M. Lucas, Robert M. Lucas Consulting, Fort Collins, CO, USA There is confusion
More informationLoad Shedding for Aggregation Queries over Data Streams
Load Shedding for Aggregation Queries over Data Streams Brian Babcock Mayur Datar Rajeev Motwani Department of Computer Science Stanford University, Stanford, CA 94305 {babcock, datar, rajeev}@cs.stanford.edu
More informationJure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research
Jure Leskovec, Cornell/Stanford University Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research Network: an interaction graph: Nodes represent entities Edges represent interaction
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 124 Section #8 Hashing, Skip Lists 3/20/17 1 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look
More informationMachine Learning: An Applied Econometric Approach Online Appendix
Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail
More informationINTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá
INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús
More informationMethods for Estimating Change from NSCAW I and NSCAW II
Methods for Estimating Change from NSCAW I and NSCAW II Paul Biemer Sara Wheeless Keith Smith RTI International is a trade name of Research Triangle Institute 1 Course Outline Review of NSCAW I and NSCAW
More informationRegression III: Advanced Methods
Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions
More informationLoad Shedding for Window Joins on Multiple Data Streams
Load Shedding for Window Joins on Multiple Data Streams Yan-Nei Law Bioinformatics Institute 3 Biopolis Street, Singapore 138671 lawyn@bii.a-star.edu.sg Carlo Zaniolo Computer Science Dept., UCLA Los Angeles,
More informationSelecting DaoOpt solver configuration for MPE problems
Selecting DaoOpt solver configuration for MPE problems Summer project report Submitted by: Abhisaar Sharma Submitted to: Dr. Rina Dechter Abstract We address the problem of selecting the best configuration
More informationTo calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.
3: Summary Statistics Notation Consider these 10 ages (in years): 1 4 5 11 30 50 8 7 4 5 The symbol n represents the sample size (n = 10). The capital letter X denotes the variable. x i represents the
More informationIndexing: Overview & Hashing. CS 377: Database Systems
Indexing: Overview & Hashing CS 377: Database Systems Recap: Data Storage Data items Records Memory DBMS Blocks blocks Files Different ways to organize files for better performance Disk Motivation for
More information