CMU MSP : SAS FORMATs and INFORMATs Howard Seltman October 15, 2017

Similar documents
CMU MSP : SAS FORMATs and INFORMATs Howard Seltman Nov. 7+12, 2018

SAS 101. Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23. By Tasha Chapman, Oregon Health Authority

PROC FORMAT. CMS SAS User Group Conference October 31, 2007 Dan Waldo

The FORMAT procedure - more than just a VALUE statement Lawrence Heaton-Wright, Quintiles, Bracknell, UK

Formats, Informats and How to Program with Them Ian Whitlock, Westat, Rockville, MD

Advanced Tutorials. Paper More than Just Value: A Look Into the Depths of PROC FORMAT

using and Understanding Formats

Introduction to SAS Mike Zdeb ( , #61

Intermediate SAS: Working with Data

SAS Institue EXAM A SAS Base Programming for SAS 9

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Using an ICPSR set-up file to create a SAS dataset

Basic Concept Review

Chapter 1 The DATA Step

Introduction to SAS. Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC

Introduction to SAS Statistical Package

Reducing SAS Dataset Merges with Data Driven Formats

Paper B GENERATING A DATASET COMPRISED OF CUSTOM FORMAT DETAILS

Paper PO06. Building Dynamic Informats and Formats

Paper # Jazz it up a Little with Formats. Brian Bee, The Knowledge Warehouse Ltd

Merge Processing and Alternate Table Lookup Techniques Prepared by

data Vote; /* Read a CSV file */ infile 'c:\users\yuen\documents\6250\homework\hw1\political.csv' dsd; input state $ Party $ Age; run;

22S:166. Checking Values of Numeric Variables

Validating And Updating Your Data Using SAS Formats Peter Welbrock, Britannia Consulting, Inc., MA

DATA Step Debugger APPENDIX 3

ERROR: ERROR: ERROR:

BIOS 546 Midterm March 26, Write the line of code that all Perl programs on biolinx must start with so they can be executed.

Overview of Data Management Tasks (command file=datamgt.sas)

Chapter 2: Getting Data Into SAS

Procedure for Stamping Source File Information on SAS Output Elizabeth Molloy & Breda O'Connor, ICON Clinical Research

Exporting & Importing Datasets & Catalogs: Utility Macros

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

Epidemiology Principles of Biostatistics Chapter 3. Introduction to SAS. John Koval

Chapter 7 File Access. Chapter Table of Contents

Procedures. PROC CATALOG CATALOG=<libref.>catalog <ENTRYTYPE=etype> <KILL>; CONTENTS <OUT=SAS-data-set> <FILE=fileref;>

DSCI 325: Handout 2 Getting Data into SAS Spring 2017

Objectives Reading SAS Data Sets and Creating Variables Reading a SAS Data Set Reading a SAS Data Set onboard ia.dfwlax FirstClass Economy

STAT 7000: Experimental Statistics I

Exam Name: SAS Base Programming for SAS 9

SAS Display Manager Windows. For Windows

SAS Viya 3.1 FAQ for Processing UTF-8 Data

DSCI 325 Practice Midterm Questions Spring In SAS, a statement must end

SAS PROGRAMMING AND APPLICATIONS (STAT 5110/6110): FALL 2015 Module 2

Cody s Collection of Popular SAS Programming Tasks and How to Tackle Them

Creation of SAS Dataset

SAS Online Training: Course contents: Agenda:

Stat 302 Statistical Software and Its Applications SAS: Data I/O

Using Dynamic Data Exchange

Control Structures. A program can proceed: Sequentially Selectively (branch) - making a choice Repetitively (iteratively) - looping

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

2. Don t forget semicolons and RUN statements The two most common programming errors.

Chapter 6: Modifying and Combining Data Sets

Beyond FORMAT Basics Mike Zdeb, School of Public Health, Rensselaer, NY

PHPM 672/677 Lab #2: Variables & Conditionals Due date: Submit by 11:59pm Monday 2/5 with Assignment 2

3. Almost always use system options options compress =yes nocenter; /* mostly use */ options ps=9999 ls=200;

1 Files to download. 3 Macro to list the highest and lowest N data values. 2 Reading in the example data file

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

Using Tcl. Learning Objectives

SAS coding for those who like to be control

Leave Your Bad Code Behind: 50 Ways to Make Your SAS Code Execute More Efficiently.

44 Tricks with the 4mat Procedure

PROC FORMAT Jack Shoemaker Real Decisions Corporation

Hidden in plain sight: my top ten underpublicized enhancements in SAS Versions 9.2 and 9.3

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA

BEYOND FORMAT BASICS 1

Using Maps with the JSON LIBNAME Engine in SAS Andrew Gannon, The Financial Risk Group, Cary NC

By the end of this section you should: Understand what the variables are and why they are used. Use C++ built in data types to create program

CS 221 Lecture. Tuesday, 4 October There are 10 kinds of people in this world: those who know how to count in binary, and those who don t.

9/21/17. Outline. Expression Evaluation and Control Flow. Arithmetic Expressions. Operators. Operators. Notation & Placement

Petros: A Multi-purpose Text File Manipulation Language

Introduction OR CARDS. INPUT DATA step OUTPUT DATA 8-1

Stat 302 Statistical Software and Its Applications SAS: Data I/O & Descriptive Statistics

ASSIGNMENT #2 ( *** ANSWERS ***) 1

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming

Introduction. Getting Started with the Macro Facility CHAPTER 1

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board

Using SAS software to fulfil an FDA request for database documentation

SAS Macro Language: Reference

Eventus Example Series Using Non-CRSP Data in Eventus 7 1

Your Own SAS Macros Are as Powerful as You Are Ingenious

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

Other Data Sources SAS can read data from a variety of sources:

Introduction to SAS. I. Understanding the basics In this section, we introduce a few basic but very helpful commands.

Customizing Your SAS Session

Base and Advance SAS

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Accessing Data and Creating Data Structures. SAS Global Certification Webinar Series

AN INTRODUCTION TO MACRO VARIABLES AND MACRO PROGRAMS Mike Zdeb, School of Public Health

Get Started Writing SAS Macros Luisa Hartman, Jane Liao, Merck Sharp & Dohme Corp.

Bash shell programming Part II Control statements

Some Subnetting Practice Problem Solutions

STAT:5400 Computing in Statistics. Other software packages. Microsoft Excel spreadsheet very convenient for entering data in flatfile

Syntax Conventions for SAS Programming Languages

Find2000: A Search Tool to Find Date-Related Strings in SAS

Acknowledgments xi Preface xiii About the Author xv About This Book xvii New in the Macro Language xxi

Calgary SAS Users Group April 7, 2016 Peter Eberhardt Fernwood Consulting Group Inc. Xue Yao Winnipeg Regional Health Authority

General Tips for Working with Large SAS datasets and Oracle tables

Transcription:

CMU MSP 36-601: SAS FORMATs and INFORMATs Howard Seltman October 15, 2017 1) Informats are programs that convert ASCII (Unicode) text to binary. Formats are programs that convert binary to text. Both come in two forms, character (which always has a $ in its names) and numeric, named for the binary data type. Common uses include reading and writing dates, storing character categories as numbers, displaying numeric codes as words, binning numeric variables, collapsing categories, and data checking on input. 2) Making your own INFORMATs: INFORMATs determine the relationship between the external file s text (in the input buffer) and the data value that INPUT places into the program data vector (PDV). a. Create the informat(s) using a PROC FORMAT, and then use them in an INFORMAT (or INPUT) statement in a DATA step. b. A PROC FORMAT can have one or more INVALUE statements that define the informat(s). (You can also use several PROC FORMATs.) Unless you follow the procedures to store it permanently, the INFORMAT will not be available in the future without re-running the PROC FORMAT. c. Restrictions: INFORMAT names must start with $ if the stored (PDV, binary) value is a string, can be up to 31 characters long, and must NOT end with a number. Do not include the final. when defining the INFORMAT. d. Syntax: INVALUE myinfmtname [(myoptions)] myvalrange[, myvalrange ]=myvalue [myvalrange[, myvalrange ]=myvalue ]; The left side of the equals sign is what is looked for in the input text, and the right side is what is stored in the PDV (with numbers converted to 8 byte internal format). Any input text that does not match any myvalrange is converted as if there were no INFORMAT (i.e., using the default INFORMAT). The most useful Option is (UPCASE) to convert data to upper case before comparing it to a myvalrange. Parentheses are needed! myvalrange is either a value such as 15 or 'Fred', or it is a range of values of the form 3-5, 'A'-'C', 3-<5 for [3,5), low-3 for (-, 3], or 120-high for [120, ) or the keyword OTHER (unquoted) to indicate all other values. myvalue is either a value whose type matches the stored data type, or it is an existing informat of the correct type inside [] s to indicate processing by 1

e. Examples that informat. Also allowed is _ERROR_ to generate an error and _SAME_ for no conversion. For more details, see The FORMAT PROCEDURE: INVALUE Statement. i. PROC FORMAT; INVALUE trial 'A'-<'N'=1 'N'-'Z'=2 1-3000=3 low-0=_error_; DATA temp; INFORMAT trial trial.; INPUT trial @@; B12 M17 O23 a23 1 789 1234 12345 2.4-5 Results are 1, 1, 2,. (with an error), 3, 3, 3, 12345, 3,. (with an error) Note: @@ means keep the rest of the input line for the next iteration of the implied for loop. ii. PROC FORMAT; /* Note: the "$" is a prefix, not a suffix */ INVALUE $gendfmt 1='M' 2='F' OTHER=_ERROR_; DATA temp; INFORMAT gender $gendfmt.; INPUT gender @@; 1 2 2 1. 0 M F Results are M, F, F, M,.,.,.,. (with errors for the missing values). PROC FORMAT; INVALUE $gend2fmt 1,M=M 2,F=F "."=" " OTHER=_ERROR_; DATA temp; INFORMAT gender $gend2fmt.; INPUT gender @@; 1 2 2 1. 0 M F Results are M, F, F, M,,, M, F with an error message for the 0 only. Note all spaces is missing for strings. 2

iii. PROC FORMAT; INVALUE mytfformat T,True,TRUE,true=1 F,False,FALSE,false=0 OTHER=_ERROR_; INVALUE mytfuformat (UPCASE) T,TRUE=1 F,FALSE=0 OTHER=_ERROR_; DATA junk; INFILE INFORMAT num mytfformat. numu mytfuformat.; INPUT num numu @@; True True TRUE TRUE false false FALSE FALSE T T 0 0 Results are 1, 1, 0, 0, 1,. for both variables. iv. In the above syntax ( list input format) the informats on the INFORMAT line must be present (via re-running PROC FORMAT or using permanent (in)formats) to use the data set in a SET statement in the future. The modified list input format shown here does not require the INFORMAT to be known for future use in a SET statement. LIBNAME here "."; DATA here.junk; INFILE INPUT num : mytfformat. @@; True TRUE false FALSE T 0 PROC FORMAT; INVALUE $fuel G=Gas O=Oil S=Solar X="Other fuels" OTHER=_ERROR_; DATA here.crap; INPUT type$ : fuel. Amount @@; CARDS; G 10 O 12.3 X 11 S 9 O 21 f. Note that there is a way to create format ranges from the fields in a dataset (see below). 3

3) Making your own formats a. Create formats for output using PROC FORMAT. Use them in DATA steps to make them defaults for some variables. Use them in PROC steps to invoke them temporarily. This is usually clearer and safer than IF/THEN in DATA steps. b. A PROC FORMAT can have one (or more) VALUE statements that define the format(s). (Or you can use several PROC FORMATs.) Unless you follow the procedures to store it permanently, the FORMAT will not be available in the future without re-running the PROC FORMAT, which can cause a problem for permanent data sets. c. Restrictions: FORMAT names must start with $ if the stored data type (PDV, binary) is a string, and can be up to 32 characters long. Do not include the final. when defining the FORMAT. d. Syntax: VALUE myfmtname myvalrange[, myvalrange ]=myvalue [myvalrange[, myvalrange ]=myvalue ]; e. Examples The left side of the equals sign is what is looked for in the data, and the right side is what is output instead. Any input text that does not match any myvalrange is output as if there were no FORMAT (i.e., using the default FORMAT). myvalrange is either a value such as 15 or 'Fred', or it is a range of values of the form 3-5 or 'A'-'C', and it must match the format/data type or it is the keyword OTHER (unquoted). myvalue is either a string value, or it is an existing format of the correct type inside [] s. If it is a format, that format is used to create the result. For more details see The FORMAT PROCEDURE: VALUE Statement. i. PROC FORMAT; VALUE $fuel G=Gas O=Oil S=Solar X="Other fuels" OTHER="Unknown type"; DATA craps; INPUT type : $1. Amount @@; FORMAT type $fuel.; CARDS; G 10 O 12.3 X 11 S 9 N 21 4

ii. PROC FORMAT; VALUE fuel 0-1=Gas 2=Oil 3=Solar 4-9="Other fuels".=missing other=error; DATA crapn; INPUT type Amount @@; FORMAT type fuel.; CARDS; 1 10 2 12.3 5 11. 9 11 21 PROC MEANS DATA=crapN; /* WHERE type = "Gas"; is an ERROR */ WHERE PUT(type, fuel.) = "Gas"; VAR amount; iii. PROC FORMAT; VALUE ageranges low-<18 = "Minor" 18-<45 = "Young adult" 45-<60 = "Middle age" 60-high = "Elderly "; PROC FREQ DATA=ageData; TABLES age / MISSING; FORMAT age ageranges.; iv. PROC FORMAT; VALUE pval 0-<0.005='<0.005' OTHER=[5.3]; DATA pvals; INPUT p @@; FORMAT p pval.; 0.23 0.05 0.006 0.005 0.00001 5

4) Using your formats and informats to create new variables a. The PUT() function expresses a variable using a FORMAT. E.g., continuing the example from above, when type=5 then PUT(type, fuel.) returns the string value "Other fuels". The return value of PUT() is always a string, and the argument must match the $ in the format name. b. The INPUT() function, uses an INFORMAT to do the equivalent of reading data from a plain text file, but using a variable for the input instead. INPUT("$12,123.45", comma10.2) returns 12123.45. This argument of INPUT() is always a string and the return value matches the $ in the informat name. c. Example: longtype becomes the long string version of type. Neither a LENGTH statement nor a $ is needed, because that info (LENGTH 11$.) is stored in the format. DATA crap2; SET craps; longtype = PUT(type, fuel.); d. Example: numtype becomes the numeric version of type. PROC FORMAT; VALUE $numfuel G="1,000" O="2,000" S="3,000" X=4 OTHER="-1"; DATA crap3; SET craps; numtype = INPUT(PUT(type, $numfuel.), COMMA7.); e. Alternate date formats DATA dates(drop=s); LENGTH s $11; INPUT s @@; IF UPCASE(SUBSTR(s, 4, 1)) >= 'A' & UPCASE(SUBSTR(s, 4, 1)) <= 'Z' THEN date = INPUT(s, DATE11.); ELSE date = INPUT(s, MMDDYY10.); FORMAT date DATE11.; 1/25/2012 12/2/2016 25-Jan-2012 2-Dec-2016 6

5) Permanent (IN)FORMATs a. Permanent (IN)FORMATs are better for data management, precluding the need to re-run PROC FORMATs each time you use your data. b. Step 1: Use PROC FORMAT LIBRARY=myLibRef; when creating the (IN)FORMAT. c. Step 2: Use OPTIONS FMTSEARCH=(myLibRef); in current and future sessions before any steps that reference the formats. (Don t forget the parentheses.) d. Not doing this for FORMATs included in DATA steps that create permanent data sets will prevent the data set from being used in the future unless you manually re-run the PROC FORMAT. e. Not doing this for INFORMATs included in the DATA steps that create permanent data sets will prevent the data set from being the target of a SET statement in a DATA step in the future unless you manually re-run the PROC FORMAT. f. Example of permanent (IN)FORMAT creation code: LIBNAME heart "heartdata"; FILENAME hdata "heartdata/hearts.txt"; /* Special formats for reading/writing lab data */ PROC FORMAT LIBRARY=heart; INVALUE readlab...; VALUE labfmt...; OPTIONS FMTSEARCH=(heart); /* create permanent data set */ DATA heart.hrtstudy; INFILE hdata; INPUT id$ date lab1 lab2 outcome; INFORMAT date DDMMYY8.; INFORMAT lab1 lab2 readlab.; FORMAT date DATE11.; FORMAT lab1 lab2 labfmt.; g. Example of permanent (IN)FORMAT data use code: LIBNAME study "heartdata"; OPTIONS FMTSEARCH=(study); DATA temp; SET study.hrtstudy; labratio = lab1/lab2; /* Analyze */ PROC REG data=temp; MODEL outcome = labratio; QUIT; /* because REG is an interactive PROC */ 7

6) INFORMATs from data sets a. The FORMAT procedure has an option CNTLIN=fmtDataSet for the PROC FORMAT statement, which reads the formatting information from the specified data set rather than from the body of the PROC. b. The data set supplies the formatting information using variables called FMTNAME, START, and LABEL, and possibly TYPE, END, SEXCL, and EEXCL (start exclude and end exclude). c. FMTNAME is the name of the format to be created. A given FORMAT PROC can create one or several formats using CNTLIN=. Each format typically consists of several data lines with the same FMTNAME. d. START specifies that value that is to be formatted or START and END specify a range of values. If both are included and a single value is to be formatted, both START and END must be set to that value. Typically, START is a string column, but if all values are numeric, a numeric type is allowed. e. LABEL contains the result of formatting, i.e., the value to be output. Typically, this column is a string, but if all values are numeric, numeric is allowed. f. If START and END are both supplied, you may include SEXLC and EEXCL (where S=start, E=end, and EXLC=exclude) and set each one to Y or N where Y means the range does exclude the specified value and N means it does not. g. If the value to be formatted is a string, then either its FMTNAME must start with a $ or a TYPE column must be included. TYPE must be C for character (string) or N for numeric on every line (P for picture is also allowed). 8

h. Example: DATA myfmts; LENGTH FMTNAME $5 START $1; INPUT FMTNAME START LABEL $15.; foods 1 broccoli foods 2 tomatoes foods 3 brussel sprouts foods 4 chicken foods. PROC FORMAT CNTLIN=myFmts; DATA myrecipes; LENGTH name $9; INPUT name ingr1 ingr2 ingr3; LABEL ingr1="ingredient 1" ingr2="ingredient 2" ingr3="ingredient 3"; FORMAT ingr1-ingr3 foods.; ChickenBS 4 3. BrocTom 1 2. Veg3 1 2 3 ChickTom 2 4. PROC PRINT DATA=myRecipes; 9