Hypothesis Testing: An SQL Analogy

Similar documents
Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

Querying Data with Transact-SQL

Anatomy of a Merge Gone Wrong James Lew, Compu-Stat Consulting, Scarborough, ON, Canada Joshua Horstman, Nested Loop Consulting, Indianapolis, IN, USA

Querying Data with Transact SQL

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

The Power of Combining Data with the PROC SQL

PharmaSUG Paper TT11

An Approach Finding the Right Tolerance Level for Clinical Data Acceptance

Querying Microsoft SQL Server 2014

[AVNICF-MCSASQL2012]: NICF - Microsoft Certified Solutions Associate (MCSA): SQL Server 2012

INTERMEDIATE SQL GOING BEYOND THE SELECT. Created by Brian Duffey

Querying Data with Transact-SQL

Querying Data with Transact-SQL

9.2 Types of Errors in Hypothesis testing

Querying Data with Transact SQL Microsoft Official Curriculum (MOC 20761)

PharmaSUG Paper AD09

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

Relational Database Management Systems for Epidemiologists: SQL Part II

Querying Microsoft SQL Server

Pharmaceuticals, Health Care, and Life Sciences

Comparison of different ways using table lookups on huge tables

Exploring DATA Step Merge and PROC SQL Join Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Checking for Duplicates Wendi L. Wright

COURSE OUTLINE MOC 20461: QUERYING MICROSOFT SQL SERVER 2014

Querying Data with Transact-SQL (20761)

Repetition Through Recursion

Querying Data with Transact-SQL

Writing Analytical Queries for Business Intelligence

Simple Rules to Remember When Working with Indexes

Querying Microsoft SQL Server

COURSE OUTLINE: Querying Microsoft SQL Server

Hash Objects for Everyone

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

"Charting the Course... MOC C: Querying Data with Transact-SQL. Course Summary

SQL. History. From Wikipedia, the free encyclopedia.

20461: Querying Microsoft SQL Server 2014 Databases

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

This lab will introduce you to MySQL. Begin by logging into the class web server via SSH Secure Shell Client

Multiple Graphical and Tabular Reports on One Page, Multiple Ways to Do It Niraj J Pandya, CT, USA

Advanced Visualization using TIBCO Spotfire and SAS

Ranking Between the Lines

Microsoft Querying Data with Transact-SQL - Performance Course

Querying Data with Transact-SQL

Clinical Data Visualization using TIBCO Spotfire and SAS

Lasso Your Business Users by Designing Information Pathways to Optimize Standardized Reporting in SAS Visual Analytics

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

Querying Microsoft SQL Server

SAS Programming Techniques for Manipulating Metadata on the Database Level Chris Speck, PAREXEL International, Durham, NC

Instructor: Craig Duckett. Lecture 03: Tuesday, April 3, 2018 SQL Sorting, Aggregates and Joining Tables

OLAP Drill-through Table Considerations

Querying Data with Transact-SQL

20461: Querying Microsoft SQL Server

A Macro that can Search and Replace String in your SAS Programs

Building Self-Service BI Solutions with Power Query. Written By: Devin

Querying Microsoft SQL Server (MOC 20461C)

An SQL Tutorial Some Random Tips

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

Course Outline. Querying Data with Transact-SQL Course 20761B: 5 days Instructor Led

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

AVANTUS TRAINING PTE LTD

Create Flowcharts Using Annotate Facility. Priya Saradha & Gurubaran Veeravel

Microsoft - Querying Microsoft SQL Server 2014 (M20461) (M20461)

PharmaSUG China Mina Chen, Roche (China) Holding Ltd.

Indenting with Style

Querying Data with Transact-SQL

Duration Level Technology Delivery Method Training Credits. Classroom ILT 5 Days Intermediate SQL Server

186 Statistics, Data Analysis and Modeling. Proceedings of MWSUG '95

New Approach to Graph Databases

Cell Suppression In SAS Visual Analytics: A Primer

Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA

Querying Microsoft SQL Server

20761B: QUERYING DATA WITH TRANSACT-SQL

Querying Microsoft SQL Server 2008/2012

Efficiently Join a SAS Data Set with External Database Tables

SAS ENTERPRISE GUIDE USER INTERFACE

How to Use Full Pushdown Optimization in PowerCenter

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

The Submission Data File System Automating the Creation of CDISC SDTM and ADaM Datasets

Microsoft Querying Microsoft SQL Server 2014

Querying Microsoft SQL Server

Unit Assessment Guide

Think about these queries for exam

A Macro to Keep Titles and Footnotes in One Place

The Output Bundle: A Solution for a Fully Documented Program Run

ERROR: The following columns were not found in the contributing table: vacation_allowed

Guide Users along Information Pathways and Surf through the Data

Querying Data with Transact-SQL

MTA Database Administrator Fundamentals Course

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

PharmaSUG Paper AD06

PhUse Practical Uses of the DOW Loop in Pharmaceutical Programming Richard Read Allen, Peak Statistical Services, Evergreen, CO, USA


Using SAS Enterprise Guide to Coax Your Excel Data In To SAS

Quicker Than Merge? Kirby Cossey, Texas State Auditor s Office, Austin, Texas

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

Querying Microsoft SQL Server 2014

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Course 20461C: Querying Microsoft SQL Server

Embarcadero PowerSQL 1.1 New Features Guide. Published: July 14, 2008

QUERYING MICROSOFT SQL SERVER COURSE OUTLINE. Course: 20461C; Duration: 5 Days; Instructor-led

Transcription:

Hypothesis Testing: An SQL Analogy Leroy Bracken, Boulder Creek, CA Paul D Sherman, San Jose, CA ABSTRACT This paper is all about missing data. Do you ever know something about someone but don't know who that someone is? Are you missing information you didn't know you don't have? Do outer joins leave you in outer space? Then read this article and come back to Earth. The leading or Pivot Table of an SQL statement is like the Null Hypothesis H 0 made during a statistical analysis. It is the starting point of a query which asks questions about data. We answer these questions, present and compare situations of missing data that are analogous to making Type-I and Type-II errors, and show some examples using real data how to estimate the likely band of error on an educated formal guess. Presented in this article is a simple solution for handling the pivot table of any SQL query so that no data is left behind. Skill Level: Intermediate Base/SAS, familiarity with Proc SQL. INTRODUCTION A recent tutorial article of hypothesis testing presents the following description (Carpenter, 2006. It is noted that the probability β of making a Type-II Error is a fairly difficult calculation. This difficulty is precisely that which we compare using an SQL and data model analogy. Actual Status of the Null Hypothesis H 0: Is True H 0: Is False Outcome of the test Test is Significant (Reject the H 0 Test is not Significant (Fail to reject the H 0 Commit a Type I Error. The probability of making this mistake is α. Correct Decision Correct Decision Commit a Type II Error. The probability of making this mistake is β. ESTIMATING ERROR FROM AN EDUCATED FORMAL GUESS Some important points to understand are the following: One often makes a Type-II error by having too small a sample size, and too much missing data. What is missing is usually most interesting or important. It is usually those points in the extrema or "tail" of the distribution, where a few points have high relative importance. Even with only 2-3% of data missing you will often get incorrect results. Monte Carlo and Backfill techniques exist to "probe" sensitivity of variables to missing information. Using a fictitious sample dataset we compare the SQL analysis results with various sample sizes, that is, with and without (or, with varying degrees of missing data. The two types of missing data we consider are the labels of the data, and the data values themselves. While the problem of missing data values is most common and leads to the Type-I error, we will see that when the labels of data are missing we don't even know that there are data values to be considered. Below are some examples of these two types of data in various scenarios. when missing, these relate to Type-II errors Scenario Labels of Data Data Values Themselves Shoe Store Style & Size Number of shoes Web Analytics Site Address Number of clicks Clinical Patient ID Demographics, Lab Results, etc. We will use the clinical scenario in all discussions to follow. relate to Type-I errors

THE SQL ANALOGY The SQL Analogy to hypothesis testing is shown in Table 2 below. Actual Content of the Pivot Table Pivot: Is Complete Pivot: Is Not Complete Content of the detail data tables Missing some detail data for some patients Have all detail data for all patients PATS has more patients than we have DEMOG or LABDAT info All data for which we have patients No patients for which we don't have data PATS is missing a patient for which we have DEMOG or LABDAT info In the data modeling analogy, the Null Hypothesis is our leading or Pivot Table, and the tests on the hypothesis are the more "concrete" detail tables applied (or, joined in to the pivot. SELECT... hypothesis FROM pivot A typical statement for the Null Hypothesis goes like this: tests of hypothesis H 0: Pivot Table contains a record matching each and every detail data table. Simply put, a Type-I Error is when we know everyone but not everything about them yet. The more subtle and more difficult to find Type-II Error happens when we know something about someone but don't know who that someone is. CLINICAL EXAMPLE Our clinical trial data model has three tables: PATS, DEMOG and LABDAT. Suppose there are five patients, but we are missing some information for some of them as shown below. PATS DEMOG LABDAT PTID PTID GENDER PTID LABDT VAL1 VAL2 1 1 M 1 1465516800 89.2 2.6 2 3 M 2 1465689600 91.7 1.5 3 4 F 5 1466812800 93.5 3.7 Notice that DEMOG is missing patients 2 and 5, while LABDAT has no information for patients 3 and 4. Further and most suspicious, PATS does not identify patients 4 and 5. When dealing with potentially missing data it is important to use "outer joins" in SQL query logic. The outer join guarantees that we don't eliminate observations when relating tables where one or the other table may fail the join relationship. The sequence of tables joined is very important for proper and efficient function of any query (Sherman, 2002. Usually, the first table specified is most abstract and called the "pivot," and its contents are to be fetched entirely and completely, not depending upon any other table. Trouble may occur when the other more concrete "supporting detail" tables contain more information than the pivot. This is analogous to violating a Null Hypothesis: Our assumption that the pivot contains one record for each and every possible record of the concrete tables is clearly false. 2

EXAMPLE 1: TYPE I ERROR - MISSING SOME DETAIL INFO Consider the following SQL statement. Our choice for the Pivot Table is the PATS table, and we are careful to use outer joins because some patient data might not yet be available. Independent pivot FROM pats The output from this query is shown below. Is it correct? NO! Patients 4 and 5 are completely unrepresented. The missing values only indicate the easier to find Type-I errors. PTID GENDER LABDT 1 M 1465516800 2 1465689600 3 M. Type-I Error EXAMPLE 2: TYPE II ERROR - MISSING SOME PIVOT INFO In order to be more precise, we must first pre-assemble the Pivot Table to collect all possible patients from all sources of information. Doing so guarantees we miss no available data. Because it is an unequivocal selection from all tables, we simply UNION or row-wise join everything together. Using the UNION ALL helps improve query efficiency by eliminating the needless sort and non-duplicating step; we perform that later with an outer-most GROUP-BY. Proc SQL requires us to add an innocuous summary function such as count(, to prevent the SQL optimizer from tearing apart the GROUP-BY. Consolidated Pivot Table Make sure pivot is unique Using this consolidated form for PATS in the main SQL statement, we correctly identify all five patients as shown below. PTID GENDER LABDT 1 M 1465516800 2 1465689600 3 M. 4 F. 5 1466812800 Type-II Error With a little extra work we can correctly report both types of errors arising from our query. Patient 4 is missing her lab data, while patient 5 lacks demographic information; neither of them are listed in the patients record table. SHOW US ONLY THE PROBLEMS... We might wish to identify only the patients with errors of either type for some special problem report. Filter the consolidated Pivot Table as follows, using a post-aggregation HAVING clause. We know that there must be one record from each data table source for complete information; any less indicates trouble. Consolidated Pivot Table HAVING count(* < 3 # tables being consolidated In the foregoing we have tacitly assumed that all tables in our data model are "flat" in a fact/dimension or data warehousing sense. This greatly simplifies the analogy because there is only one observation in each table per unique value of PatientID. Had this not been the case, such as when LABDAT is in vertical or 3

normalized form, one would in practice use a GROUP-BY on each of the UNIONed members of the consolidated Pivot Table. We don't believe this minor complication adversely affects the generality of the hypothesis test SQL analogy. CONCLUSION The probability of not rejecting the null hypothesis when it is false is proportional to the amount of missing data in the tail of a distribution. The difficulty in calculating or even estimating this probability stems from the fact that, without the null hypothesis as a basis, we have no reference point with which to compare or test the individual data points. Realizing that the null hypothesis provides an important reference point or basis is the key to the analogy with a SQL statement. In SQL, the first table listed (and, thus fetched in entirety is the pivot table. All other sources of data are "joined in" to this basis. Therefore, without the basis there is nothing with which detail tables can join. The entire difficulty of calculating {beta} compares directly to properly constructing this first, leading, or "pivoting" table in the FROM clause. The asymmetry of the FROM clause in an SQL statement becomes especially important when using outer joins. The first table listed in the FROM clause must contain an aggregate of all types of data, both the data values and the labels describing them. Starting with the basic statistics of making an educated formal guess, we have presented a simple prescription which avoids making the Type-II error: The first, or "pivot," table of an SQL query using outer joins should always be an aggregated sub-query (one with a GROUP BY clause which unequivocally ties together sources of all tables being outer-joined. REFERENCES Carpenter, Art. "Simple Tests of Hypotheses for the Non-statistician: What They Are and Why They Can Go Bad," Tutorials section of the Annual Pharmaceutical Industry SAS User Group Conference, May 21-24, 2006, Bonita Springs, FL. Paper TU06. Sherman, Paul D. "Creating Efficient SQL: Four Steps to a Quick Query," in Proceedings of the Twenty-seventh Annual SAS User Group International Conference; 2002 April 14-17; Orlando, Florida; Paper 073-27. TRADEMARK CITATION SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute, Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Leroy Bracken 315 Teilh Drive Boulder Creek, CA 95006 (831 401-1704 bracken95006@comcast.net Paul D Sherman 335 Elan Village Lane, Apt. 424 San Jose, CA 95134 (408 383-0471 sherman@idiom.com Web site: http://www.idiom.com/~sherman/paul/pubs/hyptest 4

EXAMPLE PROGRAM CODE Proc SQL noprint; create table pats ( ptid smallint primary key create table demog ( ptid smallint not null primary key, gender char(1 not null create table labdat ( ptid smallint not null primary key, labdt integer not null, val1 real, val2 real insert into pats (ptid values (1 values (2 values (3 insert into demog (ptid, gender values (1 'M' values (3 'M' values (4 'F' ; insert into labdat (ptid, labdt, val1, val2 values (1 '10JUN2006 00:00:00.0'dt 89.2 2.6 values (2 '12JUN2006 00:00:00.0'dt 91.7 1.5 values (5 '25JUN2006 00:00:00.0'dt 93.5 3.7 ; quit; Proc SQL noprint; create table bad as ( FROM pats create table good as ( as pats quit; proc print data=bad noobs; title 'Bad Output: Type-II errors not shown'; run; proc print data=good noobs; title 'Good Output: All patients identified'; run; 5