CMU MSP 36-601: SAS FORMATs and INFORMATs Howard Seltman October 15, 2017 1) Informats are programs that convert ASCII (Unicode) text to binary. Formats are programs that convert binary to text. Both come in two forms, character (which always has a $ in its names) and numeric, named for the binary data type. Common uses include reading and writing dates, storing character categories as numbers, displaying numeric codes as words, binning numeric variables, collapsing categories, and data checking on input. 2) Making your own INFORMATs: INFORMATs determine the relationship between the external file s text (in the input buffer) and the data value that INPUT places into the program data vector (PDV). a. Create the informat(s) using a PROC FORMAT, and then use them in an INFORMAT (or INPUT) statement in a DATA step. b. A PROC FORMAT can have one or more INVALUE statements that define the informat(s). (You can also use several PROC FORMATs.) Unless you follow the procedures to store it permanently, the INFORMAT will not be available in the future without re-running the PROC FORMAT. c. Restrictions: INFORMAT names must start with $ if the stored (PDV, binary) value is a string, can be up to 31 characters long, and must NOT end with a number. Do not include the final. when defining the INFORMAT. d. Syntax: INVALUE myinfmtname [(myoptions)] myvalrange[, myvalrange ]=myvalue [myvalrange[, myvalrange ]=myvalue ]; The left side of the equals sign is what is looked for in the input text, and the right side is what is stored in the PDV (with numbers converted to 8 byte internal format). Any input text that does not match any myvalrange is converted as if there were no INFORMAT (i.e., using the default INFORMAT). The most useful Option is (UPCASE) to convert data to upper case before comparing it to a myvalrange. Parentheses are needed! myvalrange is either a value such as 15 or 'Fred', or it is a range of values of the form 3-5, 'A'-'C', 3-<5 for [3,5), low-3 for (-, 3], or 120-high for [120, ) or the keyword OTHER (unquoted) to indicate all other values. myvalue is either a value whose type matches the stored data type, or it is an existing informat of the correct type inside [] s to indicate processing by 1
e. Examples that informat. Also allowed is _ERROR_ to generate an error and _SAME_ for no conversion. For more details, see The FORMAT PROCEDURE: INVALUE Statement. i. PROC FORMAT; INVALUE trial 'A'-<'N'=1 'N'-'Z'=2 1-3000=3 low-0=_error_; DATA temp; INFORMAT trial trial.; INPUT trial @@; B12 M17 O23 a23 1 789 1234 12345 2.4-5 Results are 1, 1, 2,. (with an error), 3, 3, 3, 12345, 3,. (with an error) Note: @@ means keep the rest of the input line for the next iteration of the implied for loop. ii. PROC FORMAT; /* Note: the "$" is a prefix, not a suffix */ INVALUE $gendfmt 1='M' 2='F' OTHER=_ERROR_; DATA temp; INFORMAT gender $gendfmt.; INPUT gender @@; 1 2 2 1. 0 M F Results are M, F, F, M,.,.,.,. (with errors for the missing values). PROC FORMAT; INVALUE $gend2fmt 1,M=M 2,F=F "."=" " OTHER=_ERROR_; DATA temp; INFORMAT gender $gend2fmt.; INPUT gender @@; 1 2 2 1. 0 M F Results are M, F, F, M,,, M, F with an error message for the 0 only. Note all spaces is missing for strings. 2
iii. PROC FORMAT; INVALUE mytfformat T,True,TRUE,true=1 F,False,FALSE,false=0 OTHER=_ERROR_; INVALUE mytfuformat (UPCASE) T,TRUE=1 F,FALSE=0 OTHER=_ERROR_; DATA junk; INFILE INFORMAT num mytfformat. numu mytfuformat.; INPUT num numu @@; True True TRUE TRUE false false FALSE FALSE T T 0 0 Results are 1, 1, 0, 0, 1,. for both variables. iv. In the above syntax ( list input format) the informats on the INFORMAT line must be present (via re-running PROC FORMAT or using permanent (in)formats) to use the data set in a SET statement in the future. The modified list input format shown here does not require the INFORMAT to be known for future use in a SET statement. LIBNAME here "."; DATA here.junk; INFILE INPUT num : mytfformat. @@; True TRUE false FALSE T 0 PROC FORMAT; INVALUE $fuel G=Gas O=Oil S=Solar X="Other fuels" OTHER=_ERROR_; DATA here.crap; INPUT type$ : fuel. Amount @@; CARDS; G 10 O 12.3 X 11 S 9 O 21 f. Note that there is a way to create format ranges from the fields in a dataset (see below). 3
3) Making your own formats a. Create formats for output using PROC FORMAT. Use them in DATA steps to make them defaults for some variables. Use them in PROC steps to invoke them temporarily. This is usually clearer and safer than IF/THEN in DATA steps. b. A PROC FORMAT can have one (or more) VALUE statements that define the format(s). (Or you can use several PROC FORMATs.) Unless you follow the procedures to store it permanently, the FORMAT will not be available in the future without re-running the PROC FORMAT, which can cause a problem for permanent data sets. c. Restrictions: FORMAT names must start with $ if the stored data type (PDV, binary) is a string, and can be up to 32 characters long. Do not include the final. when defining the FORMAT. d. Syntax: VALUE myfmtname myvalrange[, myvalrange ]=myvalue [myvalrange[, myvalrange ]=myvalue ]; e. Examples The left side of the equals sign is what is looked for in the data, and the right side is what is output instead. Any input text that does not match any myvalrange is output as if there were no FORMAT (i.e., using the default FORMAT). myvalrange is either a value such as 15 or 'Fred', or it is a range of values of the form 3-5 or 'A'-'C', and it must match the format/data type or it is the keyword OTHER (unquoted). myvalue is either a string value, or it is an existing format of the correct type inside [] s. If it is a format, that format is used to create the result. For more details see The FORMAT PROCEDURE: VALUE Statement. i. PROC FORMAT; VALUE $fuel G=Gas O=Oil S=Solar X="Other fuels" OTHER="Unknown type"; DATA craps; INPUT type : $1. Amount @@; FORMAT type $fuel.; CARDS; G 10 O 12.3 X 11 S 9 N 21 4
ii. PROC FORMAT; VALUE fuel 0-1=Gas 2=Oil 3=Solar 4-9="Other fuels".=missing other=error; DATA crapn; INPUT type Amount @@; FORMAT type fuel.; CARDS; 1 10 2 12.3 5 11. 9 11 21 PROC MEANS DATA=crapN; /* WHERE type = "Gas"; is an ERROR */ WHERE PUT(type, fuel.) = "Gas"; VAR amount; iii. PROC FORMAT; VALUE ageranges low-<18 = "Minor" 18-<45 = "Young adult" 45-<60 = "Middle age" 60-high = "Elderly "; PROC FREQ DATA=ageData; TABLES age / MISSING; FORMAT age ageranges.; iv. PROC FORMAT; VALUE pval 0-<0.005='<0.005' OTHER=[5.3]; DATA pvals; INPUT p @@; FORMAT p pval.; 0.23 0.05 0.006 0.005 0.00001 5
4) Using your formats and informats to create new variables a. The PUT() function expresses a variable using a FORMAT. E.g., continuing the example from above, when type=5 then PUT(type, fuel.) returns the string value "Other fuels". The return value of PUT() is always a string, and the argument must match the $ in the format name. b. The INPUT() function, uses an INFORMAT to do the equivalent of reading data from a plain text file, but using a variable for the input instead. INPUT("$12,123.45", comma10.2) returns 12123.45. This argument of INPUT() is always a string and the return value matches the $ in the informat name. c. Example: longtype becomes the long string version of type. Neither a LENGTH statement nor a $ is needed, because that info (LENGTH 11$.) is stored in the format. DATA crap2; SET craps; longtype = PUT(type, fuel.); d. Example: numtype becomes the numeric version of type. PROC FORMAT; VALUE $numfuel G="1,000" O="2,000" S="3,000" X=4 OTHER="-1"; DATA crap3; SET craps; numtype = INPUT(PUT(type, $numfuel.), COMMA7.); e. Alternate date formats DATA dates(drop=s); LENGTH s $11; INPUT s @@; IF UPCASE(SUBSTR(s, 4, 1)) >= 'A' & UPCASE(SUBSTR(s, 4, 1)) <= 'Z' THEN date = INPUT(s, DATE11.); ELSE date = INPUT(s, MMDDYY10.); FORMAT date DATE11.; 1/25/2012 12/2/2016 25-Jan-2012 2-Dec-2016 6
5) Permanent (IN)FORMATs a. Permanent (IN)FORMATs are better for data management, precluding the need to re-run PROC FORMATs each time you use your data. b. Step 1: Use PROC FORMAT LIBRARY=myLibRef; when creating the (IN)FORMAT. c. Step 2: Use OPTIONS FMTSEARCH=(myLibRef); in current and future sessions before any steps that reference the formats. (Don t forget the parentheses.) d. Not doing this for FORMATs included in DATA steps that create permanent data sets will prevent the data set from being used in the future unless you manually re-run the PROC FORMAT. e. Not doing this for INFORMATs included in the DATA steps that create permanent data sets will prevent the data set from being the target of a SET statement in a DATA step in the future unless you manually re-run the PROC FORMAT. f. Example of permanent (IN)FORMAT creation code: LIBNAME heart "heartdata"; FILENAME hdata "heartdata/hearts.txt"; /* Special formats for reading/writing lab data */ PROC FORMAT LIBRARY=heart; INVALUE readlab...; VALUE labfmt...; OPTIONS FMTSEARCH=(heart); /* create permanent data set */ DATA heart.hrtstudy; INFILE hdata; INPUT id$ date lab1 lab2 outcome; INFORMAT date DDMMYY8.; INFORMAT lab1 lab2 readlab.; FORMAT date DATE11.; FORMAT lab1 lab2 labfmt.; g. Example of permanent (IN)FORMAT data use code: LIBNAME study "heartdata"; OPTIONS FMTSEARCH=(study); DATA temp; SET study.hrtstudy; labratio = lab1/lab2; /* Analyze */ PROC REG data=temp; MODEL outcome = labratio; QUIT; /* because REG is an interactive PROC */ 7
6) INFORMATs from data sets a. The FORMAT procedure has an option CNTLIN=fmtDataSet for the PROC FORMAT statement, which reads the formatting information from the specified data set rather than from the body of the PROC. b. The data set supplies the formatting information using variables called FMTNAME, START, and LABEL, and possibly TYPE, END, SEXCL, and EEXCL (start exclude and end exclude). c. FMTNAME is the name of the format to be created. A given FORMAT PROC can create one or several formats using CNTLIN=. Each format typically consists of several data lines with the same FMTNAME. d. START specifies that value that is to be formatted or START and END specify a range of values. If both are included and a single value is to be formatted, both START and END must be set to that value. Typically, START is a string column, but if all values are numeric, a numeric type is allowed. e. LABEL contains the result of formatting, i.e., the value to be output. Typically, this column is a string, but if all values are numeric, numeric is allowed. f. If START and END are both supplied, you may include SEXLC and EEXCL (where S=start, E=end, and EXLC=exclude) and set each one to Y or N where Y means the range does exclude the specified value and N means it does not. g. If the value to be formatted is a string, then either its FMTNAME must start with a $ or a TYPE column must be included. TYPE must be C for character (string) or N for numeric on every line (P for picture is also allowed). 8
h. Example: DATA myfmts; LENGTH FMTNAME $5 START $1; INPUT FMTNAME START LABEL $15.; foods 1 broccoli foods 2 tomatoes foods 3 brussel sprouts foods 4 chicken foods. PROC FORMAT CNTLIN=myFmts; DATA myrecipes; LENGTH name $9; INPUT name ingr1 ingr2 ingr3; LABEL ingr1="ingredient 1" ingr2="ingredient 2" ingr3="ingredient 3"; FORMAT ingr1-ingr3 foods.; ChickenBS 4 3. BrocTom 1 2. Veg3 1 2 3 ChickTom 2 4. PROC PRINT DATA=myRecipes; 9