Title: Author: Using SAS to determine file and space usage in UNIX Mike Montgomery [MIS Manager, MTN (South Africa)] Abstract The paper will show tools developed to manage a proliferation of SAS files and directories in a UNIX environment. The tools were used to determine how and by whom approximately 2 terabytes of disk space was being used, how the usage was likely to grow, and how much of the space was been occupied without being used. Benefits derived from the tool included: an objective basis for a possible internal charge-back policy for the use of disk space, a basis for an archiving policy, essential information for an overdue V6-V8 conversion plan, essential information for a data warehousing project. The paper will also show techniques that extended the ODS HTML output beyond what is possible through its standard usage. Background When taking over responsibility for the MIS (Management Information Systems) department, I was presented with a huge and un-structured collection of business data that had grown over the seven years of the company s existence. This data is used to generate much of the reporting for the business. The SAS datasets, indexes and programs had accumulated during the rapid growth of a young company in the highly competitive telecommunications industry. This growth happened with less than adequate documentation. The SAS environment consisted of approximately 2 terabytes of files (excluding files that had been moved offline). These files were created by several developers (some of whom have come and gone), and were scattered over many directories (exactly how many only became known after applying the utilities described in this paper). Using the utilities discussed, it has become possible to get a thorough understanding and much better control of the environment. Summary of method It was known which disks were used to store SAS files. The utility does the following: Determines the sub-directories under the root directory of each disk. For each directory and sub-directory, searches for files with selected extensions. The extensions used by SAS are provided in SAS documentation. For each file found, determines (through operating system commands) the owner of the file, when the file was created, when modified, when last used, and the size of the file. For each version 6 and version 8 dataset, determines details about the dataset and the variables involved (using PROC DATASETS and PROC CONTENTS). For each dataset, determines whether it is unique in terms of the combination of variables it contains. If it is not unique, determines what other datasets (having the same combination of variables) it can be grouped with. The details gathered above were exported to a Windows NT SAS session from where static HTML pages were created using the SAS output delivery system (ODS). The utility extended the use of the ODS to present output in ways that are not possible with the conventional use of the ODS. In particular, output from different and dis-similar procedures can be grouped together at the discretion of the developer and arranged to any level of nesting. The data gathered from the UNIX environment is stored for month on month comparisons of the environment, and for estimating the future growth of the environment.
Example of results Example A: Space usage This shows detail of space used per user (owner). The user with the highest usage can be identified. The report allows drilling down to see what files are associated with a user. A similar presentation with drill-down is available per file type, per directory, and per department (implied from the user name). This provides a basis for applying a charge back policy per department. In drilling down to files belonging to a specific user, results similar to the following are shown. [Included in the display are: date created, date changed, and date last used.] This shows some datasets as being unique and others as belonging to a group 284. This is based on the combination of variables involved. It is possible to drill down to see what are the unique datasets, or to what other datasets are in group 284 (i.e. what other datasets use exactly the same combination of variables), or to what are the variables involved. Drilling down to group 284 displays the following.
It shows that the datasets named MANY and UC_DUPS are related to each other even though they are named differently. Example B: Un-used files This shows a summary of how much space has not been used since the month indicated. Detail of which files are involved can be selected. The information was used to determine an archiving policy. The display also shows how the index to output from different PROCs can be arranged to any level of indentation desired.
Example C: Growth This shows a summary of the growth since the previous month. A simple calculation (assuming the same growth/decline each month) was used to determine the implied annual growth and what the disk space usage will be at selected points in the future. After the current round of deleting un-necessary files, a more appropriate growth forecast can be done. The information from this output is useful for capacity planning. The other options under section 6 allow monitoring of the growth per file type, per user, per directory and per department, making it possible to see exactly where the space usage is growing/declining, thereby allowing the manager to know where action needs to be taken. The growth per user is shown below.
Example D: Index to dataset, variables, programs etc. The items in section 8 make it possible to easily find all occurrences (across multiple directories) of datasets or programs with particular names, and in which datasets variables with a particular name or label exist. This is of importance to locate un-necessary duplication of datasets and multiple versions of programs, and to identify which datasets are to be used when wanting to do analysis on selected variables. An extract of the duplicate datasets is shown Benefits achieved Automatically generated documentation. Understanding the extent of the environment being managed. Finding un-used files. Finding duplicate SAS datasets. Finding SAS programs with the same name. Ability to direct users to what files can be deleted. Finding datasets having a particular variable. Finding datasets having a variable with a particular label. Future work Extending the analysis to a VMS environment. The company has some data in a VMS environment. The utility will be copied to the VMS environment. The UNIX-specific system commands need to be changed to VMS equivalents. Extending the grouping of datasets.
The grouping of datasets based on the combinations of variables they contain has been useful in identifying which datasets are possibly redundant. This will be extended to identifying that the variables in a group are a subset of the variables in another group. This will lead to the possibility of datasets in the group with the smaller set of variables being scrapped in favour of the datasets in the group with the larger set of variables. Extending the analysis to non-sas files. As a result of interest from the operations management, the techniques will be applied to build up similar documentation on space usage by non-sas files in the UNIX environment. All that need change is the file extensions searched for. Appendix 1: An extract of a macro to determine sub-directories of a UNIX directory. The macro caters for path names that contain blanks by enclosing them in quotes in the UNIX command. It creates a dataset containing the names of the sub-directories. Example of UNIX command generated: ls -lr /usr/users/name of path grep./ > xxx.txt Global variables used: &pgmroot name of path from where the utility is run. &maxlen maximum length of path names. %macro subdirs(root,out); %local file; %let file=&pgmroot/xxsubdirs.txt; data _null_; call system("ls -lr '&root' grep./ > &file"); data &out baddir; infile "&file" lrecl=&maxlen pad; length path $ &maxlen; input path $ & 1 - &maxlen; path=left(trim(compress(path,':'))); if substr(path,1,1) = '/' then output &out; else if substr(path,1,1) = 'l' then link logical; else output baddir; return; logical: /** decide how to handle logical links. **/ delete; return; %mend; Sample of file created and then read by the macro above. /usr/users/sasuser/production: /usr/users/sasuser/production/checkfiles: /usr/users/sasuser/production/cmt_programs: /usr/users/sasuser/production/cmt_programs_production: /usr/users/sasuser/production/cmt_scripts: /usr/users/sasuser/production/datfiles: /usr/users/sasuser/production/itsv: /usr/users/sasuser/production/kpi:
Appendix 2: An extract from macros to read operating system details about files. The macro %findext searches a specified UNIX directory for files with a particular extension. It creates a dataset containing the names of the files, their size, date created, date last used, date last changed, and owner of each file. It caters for path names that contain blanks by enclosing them in quotes in the UNIX command. Detail of %readext is not shown. It reads the files created by the macro %findext. See example of file below. 1 st parameter of %readext is the name of a dataset to be created. 2 nd parameter of %readext is the name of the file to be read. Example of UNIX command generated: ls -lu /usr/users/name of path /*.SAS > xxx.txt %macro findext(dir,ext,dataset,type); %local created used changed; %let created=&pgmroot/xxcreated.txt; %let used =&pgmroot/xxused.txt; %let changed=&pgmroot/xxchanged.txt; data _null_; ext=left(trim("&ext")); call system("ls -l '&dir'/*." ext " > &created"); call system("ls -lu '&dir'/*." ext " > &used"); call system("ls -lc '&dir'/*." ext " > &changed"); %readext(created,&created); %readext(used,&used); %readext(changed,&changed); data &dataset; length type $ 15; retain type "&type"; /* a description of the type of file. E.g. V6 data */ merge created(rename=(date=created)) %mend; used (rename=(date=used)) changed(rename=(date=changed)); by file; Sample of file created and then read by the macros above. Note that the UNIX command is inconsistent in reporting the date or the time (see 2 nd last line). 1 -rwxrwxrwx 1 sasuser users 364 May 29 2000 indexip24.sas 1 -rwxrwxrwx 1 sasuser users 364 May 29 2000 indexoo24.sas 1 -rwxrwxrwx 1 sasuser users 321 Jun 1 2000 indexop24.sas 1 -rw-rw-rw- 1 hoosen_i users 32 Jul 16 2001 iqudb7.sas 1 -rwxrwxrwx 1 sasuser users 156 May 24 2000 marlene.sas 1 -rwxrwxrwx 1 sasuser users 200 May 23 2000 mddbtest.sas 12 -rwxrwxrwx 1 sasuser users 12274 Apr 22 13:56 mis_auto.sas 9 -rwxrwxrw- 1 sasuser users 8798 Nov 20 2000 mis_auto_20112000.sas
Appendix 3: Extending the use of ODS HTML output These macros were written after examining the contents file at different stages of its creation by the ODS. I cannot claim to fully understand each of the HTML parameters used, although I can guess at some. The macros enable me to work with the contents page as a SAS specialist rather than as an HTML specialist. Once I find some time to learn HTML, I can extend the macros to include HTML specific objects (e.g. drop down lists). %macro htmlproclabel(text=); /* text that would have been generated by ODS PROCLABEL */ put '<font color="#003399"><li><span>' "&text" '</SPAN><br></font>'; %mend; %macro htmllevel(type=,href=,target=,text=,break=yes); /* manage indentation levels */ %if &type=new %then %do; put '<dl>'; %end; %if &type=new or &type= %then %do; put '<dt><b> </b>' '<A HREF="' "&href" '" TARGET="' "&target" '">' "&text" '</a><br>'; %end; %if &type=end %then %do; put '</dl>'; %if &break=yes %then %do; put '<br>'; %end; %end; %mend; Usage: 1) Create the contents file. %let outpath = c:\output\destination; /* path to receive output files */ %let framefile =abc_frame; %let contfile =&framefile._contents; %let anchor =xyz; ods html path ="&outpath"(url=none) frame ="&framefile..html" contents="&contfile..html" (no_bottom_matter) body ="xxx.html" /* needed to satisfy ods, but does not affect contents page */ ; Include style=, newfile=, anchor= etc. as appropriate.
2) Stop SAS from updating the contents file. ods html close; ods html path ="&outpath"(url=none) body ="xxx.html" /* needed to satisfy ods, but does not affect contents page */ ; Include style=, newfile=, anchor= etc. as appropriate, but not frame= or contents=. 3) Create HTML ouput using SAS procedures. Note the names of the HTML files for use in the next step. 4) Use a DATA step to write into the contents file to achieve the desired layout. filename contents "&outpath\&contfile..html" mod ; data _null_; file contents; %htmlproclabel(text=%quote(section heading, numbering is automatic)); %htmllevel(type=new,href=&outpath.outputa.html#&anchor.1,target=body,text=indented text); %htmllevel(type=,href=&outpath.ouputb.html#&anchor.6,t arget=body,text=more text);... etc... %htmllevel(type=end); %htmlproclabel(text=%quote(another section, use quote if with commas)); %htmllevel(type=new,href=&outpath.outputx.html#&anchor.4,target=body,text=indented text); %htmllevel(type=,href=&outpath.ouputy.html#&anchor.3,t arget=body,text=more text);... etc... %htmllevel(type=end); put '</BODY></HTML>';