Checking for Duplicates Wendi L. Wright

Similar documents
Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

22S:172. Duplicates. may need to check for either duplicate ID codes or duplicate observations duplicate observations should just be eliminated

A Side of Hash for You To Dig Into

Chapter 6: Modifying and Combining Data Sets

11/27/2011. Derek Chapman, PhD December Data Linkage Techniques: Tricks of the Trade. General data cleaning issue

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Essentials of PDV: Directing the Aim to Understanding the DATA Step! Arthur Xuejun Li, City of Hope National Medical Center, Duarte, CA

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

Using Proc Freq for Manageable Data Summarization

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

Interleaving a Dataset with Itself: How and Why

The Proc Transpose Cookbook

Processing SAS Data Sets

Creating Macro Calls using Proc Freq

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Omitting Records with Invalid Default Values

PROC MEANS for Disaggregating Statistics in SAS : One Input Data Set and One Output Data Set with Everything You Need

SAS seminar. The little SAS book Chapters 3 & 4. April 15, Åsa Klint. By LD Delwiche and SJ Slaughter. 3.1 Creating and Redefining variables

Paper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation

The Plot Thickens from PLOT to GPLOT

A Quick and Gentle Introduction to PROC SQL

Tackling Unique Problems Using TWO SET Statements in ONE DATA Step. Ben Cochran, The Bedford Group, Raleigh, NC

R L Anderson and M P Koranda

Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization

Copy That! Using SAS to Create Directories and Duplicate Files

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

Cleaning up your SAS log: Note Messages

2. Don t forget semicolons and RUN statements The two most common programming errors.

Ranking Between the Lines

SCL Arrays. Introduction. Declaring Arrays CHAPTER 4

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

Lecture 1 Getting Started with SAS

Creating and Executing Stored Compiled DATA Step Programs

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

SESUG 2014 IT-82 SAS-Enterprise Guide for Institutional Research and Other Data Scientists Claudia W. McCann, East Carolina University.

Please don't Merge without By!!

3. Almost always use system options options compress =yes nocenter; /* mostly use */ options ps=9999 ls=200;

An Animated Guide: Proc Transpose

The Problem With NODUPLICATES, Continued

T.I.P.S. (Techniques and Information for Programming in SAS )

Effectively Utilizing Loops and Arrays in the DATA Step

Chaining Logic in One Data Step Libing Shi, Ginny Rego Blue Cross Blue Shield of Massachusetts, Boston, MA

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

Hash Objects for Everyone

Taming a Spreadsheet Importation Monster

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

Beginner Beware: Hidden Hazards in SAS Coding

Facilitate Statistical Analysis with Automatic Collapsing of Small Size Strata

A Macro for Systematic Treatment of Special Values in Weight of Evidence Variable Transformation Chaoxian Cai, Automated Financial Systems, Exton, PA

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

The Dataset Diet How to transform short and fat into long and thin

Statistics, Data Analysis & Econometrics

Choosing the Right Procedure

DSCI 325: Handout 9 Sorting and Options for Printing Data in SAS Spring 2017

Imelda C. Go, South Carolina Department of Education, Columbia, SC

PharmaSUG Paper TT11

A Practical Guide to SAS Extended Attributes

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

Not Just Merge - Complex Derivation Made Easy by Hash Object

Using SAS 9.4M5 and the Varchar Data Type to Manage Text Strings Exceeding 32kb

Hypothesis Testing: An SQL Analogy

ABSTRACT INTRODUCTION PROBLEM: TOO MUCH INFORMATION? math nrt scr. ID School Grade Gender Ethnicity read nrt scr

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

Virtual Accessing of a SAS Data Set Using OPEN, FETCH, and CLOSE Functions with %SYSFUNC and %DO Loops

EXAM - A Clinical Trials Programming Using SAS 9 Accelerated Version. Buy Full Product.

Program Validation: Logging the Log

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

A Utility Program for Quickly Identifying LOG Error or Warning Messages

Simplifying Effective Data Transformation Via PROC TRANSPOSE

Posters. Workarounds for SASWare Ballot Items Jack Hamilton, First Health, West Sacramento, California USA. Paper

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board

How to Create Data-Driven Lists

CHAPTER 7 Using Other SAS Software Products

Producing Summary Tables in SAS Enterprise Guide

Going Under the Hood: How Does the Macro Processor Really Work?

Developing Data-Driven SAS Programs Using Proc Contents

3. Data Tables & Data Management

100 THE NUANCES OF COMBINING MULTIPLE HOSPITAL DATA

Math for Liberal Studies

ABSTRACT INTRODUCTION SIMPLE COMPOSITE VARIABLE REVIEW SESUG Paper IT-06

Registering for classes Help

Data Representation. Variable Precision and Storage Information. Numeric Variables in the Alpha Environment CHAPTER 9

Getting it Done with PROC TABULATE

Uncommon Techniques for Common Variables

Tales from the Help Desk 6: Solutions to Common SAS Tasks

Using PROC TABULATE and ODS Style Options to Make Really Great Tables Wendi L. Wright, CTB / McGraw-Hill, Harrisburg, PA

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

MISSING VALUES: Everything You Ever Wanted to Know

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

Using SAS/SCL to Create Flexible Programs... A Super-Sized Macro Ellen Michaliszyn, College of American Pathologists, Northfield, IL

Exam Questions A00-281

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

Page 1. Welcome to IntegriLink 3 rd Party webform for the Walmart Global Ethics Reporting System

Transcription:

Checking for Duplicates Wendi L. Wright ABSTRACT This introductory level paper demonstrates a quick way to find duplicates in a dataset (with both simple and complex keys). It discusses what to do when they are found, and shows the difference between the PROC SORT options NODUP and NODUPKEY. The last example presented will show how to use the option NODUPKEY to keep only the record with the highest value of a variable when duplicates are found. INTRODUCTION Have you ever run a program that merged two or more datasets using a BY statement and received the note MERGE statement has more than one data set with repeats of BY values? When you get this message it generally means that your resulting dataset is not going to be what you intended. When you do a merge with a BY statement, you should make sure that all or all but one dataset has unique values in your BY statement variables. One dataset can have duplicate key values, but only the one dataset may have duplicate keys for the merge to work correctly. FINDING DUPLICATES FOR SIMPLE (OR SINGLE VARIABLE) KEYS I have a short piece of code that uses a combination of two PROCs (FREQ and PRINT) to find duplicates in my dataset s key values. See an example below. PROC FREQ; TABLES keyvar / noprint out=keylist; PROC PRINT; WHERE count ge 2; The above code will create a dataset with all the values of the key variable. The proc print then scans that distribution to find any values that have a count value of 2 or more and prints them. This tells you not only that you have duplicates, but which key values are repeated. Then you can investigate and remove the problem observations. FINDING DUPLICATES FOR COMPLEX (MULTIPLE VARIABLE) KEYS If you have a unique key that is comprised of several variables, you can modify the above code this way: PROC FREQ; TABLES keyvar1*keyvar2*keyvar3*keyvar4 / noprint out=keylist; PROC PRINT; WHERE count ge 2; 1

Although this might work, you may also get the message: ERROR: Unable to allocate sufficient memory. At least 1168978K bytes were requested. You must either increase the amount of memory available, or approach the problem differently. When this error occurs, you can 'trick' SAS into creating your Proc FREQ dataset by adding a new variable (let s call it keyvar ) to your dataset using the concatenation operator. First set the new variable s length, which will be the sum of the lengths of all of your key variables, then assign the value of keyvar by using the concatenation operator. If one of the variables is numeric, you can add it into the concatenated keyvar variable by using the put function put( numeric_var,3. ). DATA ---; other statements ; LENGTH keyvar $50; Keyvar = keyvar1 put(keyvar2,3.) keyvar3 keyvar4; other statements ; The new variable will function as a simple key and you can run the proc freq on the new variable by itself. WHAT TO DO WHEN YOU FIND DUPLICATES First determine why there are duplicates. Is the variable supposed to be unique? At this point you may need to go back to the project originators to determine if the variable was supposed to be unique. After reviewing the duplicate observations you have found, the project originators may decide to change the key variable to something else entirely, or they may decide to add other variables (making it a complex key), or they may ask you to eliminate the duplicate records, either by random selection or by following a particular rule. What you do to fix the issue will depend on what they say. We are going to assume they asked you to remove the duplicates. The example below will remove the duplicate records using the PROC SORT procedure. DESCRIPTION OF SOME DATA WITH DUPLICATES You use the NODUP option on PROC SORT to remove records where every variable in this record matches every variable in the last observation. You must be careful here with what variables are in the BY statement. SAS only looks at one previous observation when comparing variables and deciding what records to remove. For example, here is a sample dataset with some duplicates in it. 2

As you can see, there are various forms of duplication - some of the records above are duplicated across all variables and some are only duplicated on the key variables (name or name/admin). Note that: Mary has two different scores for the same admin (Mar06). If your key is name and admin, then you would have a problem and would need to find out what had caused this duplication. Peter has two identical records for the same admin (Jun06), but they are not sequential in the dataset. You will see how this can cause problems if you ask SAS to remove the duplicate using the NODUP option. John has two identical records, and his are in the dataset sequentially. NODUP OPTION The program below removes exact duplicates by using only the Name variable. The resulting dataset is next to it. Note that John s extra record has been deleted, but not Peter s. Peter s two identical records are not next to one another in the dataset and SAS only looks at one record back. To fix this, you could sort by NAME and ADMIN both, but the safest way to fix it is to sort on all variables in the dataset (you should use caution here as this can be a real resource hog). PROC SORT data=sample NODUP; BY name; To sort by all the variables without having to list them all in the program, you can use the keywork _ALL_ in the BY statement (see below). You can see that now Peter s duplicate record has been deleted. PROC SORT data=sample NODUP; BY _all_; NODUPKEY OPTION An alternative to the NODUP option, the NODUPKEY option compares only the variables listed in the BY statement when it looks for matches. If the BY statement variables are the same, then any subsequent records after the first one encountered are removed. For example, using the same input dataset as for 3

the NODUP option, we run the following code. Note that this results in only one observation for each person is kept. You can also run this with both name and admin in the BY statement, in which case you would end up with only one observation for each name/admin combination. PROC SORT data=sample NODUPKEY; BY name; SELECTING WHICH OBSERVATION SHOULD BE KEPT WHEN USING NODUPKEY In order to specify which observation to keep when you use the NODUPKEY option, you should run a pre-sort running a SORT with the NODUPKEY option. For example, if you are asked to keep the score for the last admin for each student, or to keep the highest score value. The example below shows how to keep the highest score for each student. PROC SORT data=sample; BY id descending score; Now if we sort by name with the NODUPKEY option, only the highest score will be kept and we get the results we want. Notice that the score variable is no longer one of the BY variables. Since this dataset has already been sorted, SAS takes very little time to do this second sort. You can see that now the dataset holds only unique values in the name field. PROC SORT data=sample nodupkey; BY id; This technique works for complex keys as well. Just include all the key variables in the BY statement. The entire set of key variables should appear in both sorts and should be placed before the descending score portion of the first sort. 4

CONCLUSION This was a brief introduction in how to resolve problems when merging datasets without unique values in your key fields. The solutions I presented in order to find and fix duplicates have come in very handy for me on many of the programs I have written. I hope that you might find them handy as well. AUTHOR CONTACT Your comments and questions are valued and welcome. Contact the author at: Wendi L. Wright 1351 Fishing Creek Valley Rd. Harrisburg, PA 17112 Phone: (717) 513-0027 E-mail: wendi_wright@ctb.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5