Importing CSV Data to All Character Variables Arthur L. Carpenter California Occidental Consultants, Anchorage, AK

Similar documents
Reading and Writing RTF Documents as Data: Automatic Completion of CONSORT Flow Diagrams

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Using PROC FCMP to the Fullest: Getting Started and Doing More

Using PROC FCMP to the Fullest: Getting Started and Doing More

Advanced PROC REPORT: Getting Your Tables Connected Using Links

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Using SAS Enterprise Guide to Coax Your Excel Data In To SAS

PROC REPORT Basics: Getting Started with the Primary Statements

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

DSCI 325: Handout 2 Getting Data into SAS Spring 2017

Using GSUBMIT command to customize the interface in SAS Xin Wang, Fountain Medical Technology Co., ltd, Nanjing, China

Are You Missing Out? Working with Missing Values to Make the Most of What is not There

Chapter 2: Getting Data Into SAS

Stat 302 Statistical Software and Its Applications SAS: Data I/O

title1 "Visits at &string1"; proc print data=hospitalvisits; where sitecode="&string1";

Using ANNOTATE MACROS as Shortcuts

Using an ICPSR set-up file to create a SAS dataset

SAS2VBA2SAS: Automated solution to string truncation in PROC IMPORT Amarnath Vijayarangan, Genpact, India

Table Lookups: From IF-THEN to Key-Indexing

Job Security: Using the SAS Macro Language to Full Advantage

SAS/ACCESS Interface to PC Files for SAS Viya : Reference

Using the SQL Editor. Overview CHAPTER 11

SAS 101. Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23. By Tasha Chapman, Oregon Health Authority

Template Versatility Using SAS Macro Language to Generate Dynamic RTF Reports Patrick Leon, MPH

ET01. LIBNAME libref <engine-name> <physical-file-name> <libname-options>; <SAS Code> LIBNAME libref CLEAR;

Writing Programs in SAS Data I/O in SAS

Using Dynamic Data Exchange

Beginner Beware: Hidden Hazards in SAS Coding

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Stat 302 Statistical Software and Its Applications SAS: Data I/O & Descriptive Statistics

SAS/ACCESS Interface to PC Files for SAS Viya 3.2: Reference

Moving Data and Results Between SAS and Excel. Harry Droogendyk Stratia Consulting Inc.

Other Data Sources SAS can read data from a variety of sources:

Lab 1. Introduction to R & SAS. R is free, open-source software. Get it here:

SAS/ACCESS 9.2. Interface to PC Files Reference. SAS Documentation

The Programmer's Solution to the Import/Export Wizard

Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA

WKn Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 9

Lecture 1 Getting Started with SAS

Innovative SAS Techniques

BI-09 Using Enterprise Guide Effectively Tom Miron, Systems Seminar Consultants, Madison, WI

Conditional Processing Using the Case Expression in PROC SQL

Introduction / Overview

Use That SAP to Write Your Code Sandra Minjoe, Genentech, Inc., South San Francisco, CA

Demystifying PROC SQL Join Algorithms

Anticipating User Issues with Macros

JMP Coders 101 Insights

Arthur L. Carpenter California Occidental Consultants

Let Hash SUMINC Count For You Joseph Hinson, Accenture Life Sciences, Berwyn, PA, USA

2. Click File and then select Import from the menu above the toolbar. 3. From the Import window click the Create File to Import button:

Dictionary.coumns is your friend while appending or moving data

So, Your Data are in Excel! Ed Heaton, Westat

April 4, SAS General Introduction

Conditional Processing in the SAS Software by Example

A Practical Introduction to SAS Data Integration Studio

Building Intelligent Macros: Using Metadata Functions with the SAS Macro Language Arthur L. Carpenter California Occidental Consultants, Anchorage, AK

SAS Graphics & Code. stat 480 Heike Hofmann

STAT 7000: Experimental Statistics I

SAS Drug Development Program Portability

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

SAS PROGRAMMING AND APPLICATIONS (STAT 5110/6110): FALL 2015 Module 2

Guide Users along Information Pathways and Surf through the Data

APPENDIX 4 Migrating from QMF to SAS/ ASSIST Software. Each of these steps can be executed independently.

Lab #9: ANOVA and TUKEY tests

DSCI 325: Handout 3 Creating and Redefining Variable in SAS Spring 2017

Developing SAS Studio Repositories

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

TS04. Running OpenCDISC from SAS. Mark Crangle

Biostatistics & SAS programming. Kevin Zhang

SAS Macros of Performing Look-Ahead and Look-Back Reads

Base and Advance SAS

Accessing Data and Creating Data Structures. SAS Global Certification Webinar Series

BIOMETRICS INFORMATION

The correct bibliographic citation for this manual is as follows: SAS Institute Inc Proc EXPLODE. Cary, NC: SAS Institute Inc.

Using Macro Functions

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

Table Lookup Techniques: From the Basics to the Innovative

Level I: Getting comfortable with my data in SAS. Descriptive Statistics

Automated Checking Of Multiple Files Kathyayini Tappeta, Percept Pharma Services, Bridgewater, NJ

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Simple Rules to Remember When Working with Indexes

SAS/FSP 9.2. Procedures Guide

Connect with SAS Professionals Around the World with LinkedIn and sascommunity.org

How to Create Data-Driven Lists

Taming a Spreadsheet Importation Monster

Introducing a Colorful Proc Tabulate Ben Cochran, The Bedford Group, Raleigh, NC

DATA Step in SAS Viya : Essential New Features

One SAS To Rule Them All

SUGI 29 Data Warehousing, Management and Quality

Preserving your SAS Environment in a Non-Persistent World. A Detailed Guide to PROC PRESENV. Steven Gross, Wells Fargo, Irving, TX

Choosing the Right Tool from Your SAS and Microsoft Excel Tool Belt

Paper Phil Mason, Wood Street Consultants

Producing Summary Tables in SAS Enterprise Guide

Exchanging data between SAS and Microsoft Excel

Building and Using User Defined Formats

Importing and Updating Product Data using Import Wizard

PLA YING WITH MACROS: TAKE THE WORK OUT OF LEARNING TO DO MACROS. Arthur L. Carpenter

The DATA Statement: Efficiency Techniques

KV-SS090. Operating Instructions. Instant Scanning Software. Model No.

Eventus Example Series Using Non-CRSP Data in Eventus 7 1

Transcription:

PharmaSUG 2017 QT02 Importing CSV Data to All Character Variables Arthur L. Carpenter California Occidental Consultants, Anchorage, AK ABSTRACT Have you ever needed to import data from a CSV file and found that some of the variables have been incorrectly assigned to be numeric? When this happens to us we may lose information and our data may be incomplete. When using PROC IMPORT on an EXCEL file we can avoid this problem by specifying the MIXED=YES option to force all the variables to be character. This option is not available when using IMPORT to read a CSV file. Increasing GUESSINGROWS can help, but what if it is not sufficient? It is possible to force IMPORT to only create character variables. Although there is no option to do this, you can create a process that only creates character variables, and the process is easily automated so that no intervention is required on the user s part. KEYWORDS PROC IMPORT, CSV, Character variables, MIXED= option INTRODUCTION Within SAS there are a number of ways to import CSV data. The Import Wizard will build a PROC IMPORT step for you, you can write your own PROC IMPORT step, or you can write your own DATA step using the INPUT statement to control how the data are to be read. The DATA step requires the user to have a certain knowledge of the incoming data. Variable attributes such as name, type, and length among others must be specified to create a DATA step that reads the data successfully. While the use of the DATA step is very powerful, the user is required to know something about the data itself before coding the DATA step. This necessity for prior knowledge means that this approach is less flexible when it is applied to multiple data sets or to data sets with unknown variable attributes. The IMPORT procedure has the ability to self-determine some of the data set s variable attributes during execution. It scans the data to determine the variable names, and whether they are character or numeric, and if character their length. In order to make these determinations all or part of the data are read twice. The first scanning pass is used to determine the data characteristics (variable names and attributes), and then a second pass builds the data set itself. During the first scanning pass SAS only reads the first few rows of the incoming data, and the number of rows to be scanned is specified by the GUESSINGROWS option. When GUESSINGROWS is small, fewer resources are expended, but the possibility of incorrect determinations, such as a character field being assigned to a numeric variable, is increased. One solution that avoids the problem of miss-assignment of character fields into numeric variables, and the subsequent loss of information is to force all variables to be character. When reading Excel files using PROC IMPORT this is easily accomplished by using the MIXED=YES option. Unfortunately that option does not exist when using IMPORT to read CSV files. The process outlined below provides the ease and flexibility of PROC IMPORT while forcing all fields into character variables. THE APPROACH The technique outlined here is quite simple and is easily automated for any data coming into SAS from a CSV file using PROC IMPORT. Although the examples shown here assume that the CSV file has a header record containing the column 1

names this is generally not a necessary requirement, and minor adjustments to the code would accommodate files without a column name header record. The three basic steps discussed in this paper are: 1) Create a new observation and insert a neutral non-numeric character into each field 2) Read the data using PROC IMPORT 3) Remove the neutral character EXAMPLE DATA The data used in this example has five columns, with one column taking on only blank values. The other columns are a mixture of numeric and character variables. We need all columns to be read in as character. This is the view of the CSV file as rendered by Excel. When the CSV file is viewed using a text editor, we can more easily visualize the actual structure of the data that we will be manipulating. clinicnumber,notused,clinicname,region,death INSERTING A NEUTRAL CHARACTER The technique described in this paper works by adding a new data row and by inserting an asterisk ( * ) in each data column on that new data row. This is done before the file is seen by PROC IMPORT, and requires an additional pass of the data. We read the first row which contains the names of the variables. These are counted and a series of asterisks are written on the first data line following the variable clinicnumber,notused,clinicname,region,death names. The new data file will now have an *,*,*,*,* additional row of data; a row of comma separated asterisks is now immediately following the row of column names. The asterisks are inserted through the use of a DATA step, which reads the incoming CSV file and immediately rewrites it to a temporary file. 2

The series of inserted asterisks forces PROC IMPORT to see each column as a character variable, but because each individual asterisk has a length of 1, it cannot change or determine the variable s actual length. * HOLDIT is a temporary location; filename holdit temp lrecl=32768; ➊ * Point to the csv file; filename rawcsv "YOUR PATH.csv" lrecl=32768; ➋ data _null_; file holdit; infile rawcsv end=done; * Read the first observation (variable names); input; ➌ * Write the var names; put _infile_; ➍ * Count the variables (the names are comma separated); wcount = countw(_infile_,','); ➎ * Write an asterisk for each variable - forces each to be character; * but will not set the var length; string = catt('*',repeat(',*',wcount-2)); ➏ * Write the asterisks to the new data file; put string; ➐ * Read and write the data portion of the raw data; do until(done); ➑ input; ********** * Pre processing of the data could be done here **********; * Write the record; put _infile_; end; stop; filename rawcsv; ➒ ➊ A temporary file will be used to hold the augmented data. A large LRECL value makes sure that the file is wide enough. 32K is the widest possible length field. ➋ The incoming data is named. It does not matter that the LRECL may be oversized. ➌ The first line (with the column names) is read and stored in the automatic variable _INFILE_. ➍ The list of variable names is written into the temporary data file. ➎ The number of variables are counted. This assumes that each column is named. ➏ A series of comma separated asterisks are generated using the REPEAT function. ➐ The list of asterisks are written to the temporary data file. This will eventually become the first row of data values. The asterisks will force PROC IMPORT to assume that each column is character. 3

➑ A DO loop is used to read and write each of the remaining data lines to the temporary data file. The value of the temporary variable DONE is set to one when the last record is read from the incoming data file. DONE is created by the INFILE statement. ➒ The libref pointing to the raw data is cleared. The temporary data file can now be read using PROC IMPORT. READING THE DATA PROC IMPORT is then used to import the modified CSV file into a SAS data set. Although the GUESSINGROWS option is no longer needed to determine whether a variable is numeric or character, it is still used to determine the maximum length for each variable. The DATAROW=2 option is used to make sure that the row of asterisks is included in the read, but not the variable names. * Read the Altered CSV file into a SAS dataset; PROC IMPORT OUT= fromcsv ➊ DATAFILE= holdit ➋ DBMS=CSV REPLACE; guessingrows=444444; ➌ GETNAMES=YES; ➍ DATAROW=2; ➎ * Clear the temporary fileref; filename holdit; ➏ ➊ The new outgoing data set is named. ➋ The temporary file, with the asterisks, is to be used by IMPORT to form the new data set. ➌ GUESSINGROWS is set to an arbitrarily large number. ➍ The column names in the first row are to be used to form the variable names. ➎ The data starts in row 2. This is the row that contains the asterisks. ➏ The libref associated with the temporary file is cleared. REMOVING THE NEUTRAL CHARACTER The first time this new data set (WORK.FROMCSV) is used, the asterisks can be removed by using the FIRSTOBS data set data noasterisk; set fromcsv(firstobs=2); * Doing other stuff here.; option. Because the FIRSTOBS option specifies the first row of data that is to be read into the DATA step, it is easy to eliminate the asterisks which will always be in observation 1. The DATA step is then available to provide whatever other processing is to be done with this data set. FIRSTOBS can be used as either a data set option or as a system option. This means that the step that eliminates the observation containing the asterisks could be a procedure step as well as a DATA step. SUMMARY Although the MIXED=YES option is not available to the IMPORT procedure when reading CSV files, you can still force all variables to be character. By inserting a series of asterisks into the CSV file, the IMPORT procedure automatically assigns the variable type of character to each variable. 4

ABOUT THE AUTHOR Art Carpenter is a SAS Certified Advanced Professional Programmer and his publications list includes; five books and numerous papers and posters presented at SAS Global Forum, SUGI, PharmaSUG, WUSS, and other regional conferences. Art has been using SAS since 1977 and has served in various leadership positions in local, regional, and international user groups. Recent publications are listed on my sascommunity.org Presentation Index page. http://sascommunity.org/wiki/presentations:artcarpenter_papers_and_presentations AUTHOR CONTACT Arthur L. Carpenter California Occidental Consultants 10606 Ketch Circle Anchorage, AK 99515 (907) 865-9167 art@caloxy.com www.caloxy.com REFERENCES The technique shown in this paper was demonstrated in a sascommunity.org article and was highlighted on a sascommunity.org Tip of the Day. TRADEMARK INFORMATION SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5