An Introduction to Compressing Data Sets J. Meimei Ma, Quintiles

Size: px
Start display at page:

Download "An Introduction to Compressing Data Sets J. Meimei Ma, Quintiles"

Transcription

1 An Introduction to Compressing Data Sets J. Meimei Ma, Quintiles r:, INTRODUCTION This tutorial introduces compressed data sets. The SAS system compression algorithm is described along with basic syntax. The efficiency trade offs between saving space and CPU (machine "thinking" time) are explored. Examples of the level of possible space savings are presented. The goal is to provide fundamental knowledge in order to encourage deliberate consideration of the COMPRESS: option. The topic is limited to the COMPRESS= option for SAS data sets. No discussion is included about operating system compression utilities such as ZIP. See Bardsley (1993) for more information about other compression alternatives. Note that if you use tape storage, compression is not an option. Do not confuse the COMPRESS= option with the COMPRESS function for character variables. The primary target audience for this topic includes programmers, data managers, statisticians, and other people who make decisions related to storing large data sets or saving space in general. Only basic knowledge of DATA step programming in the SAS system is assumed. The focus is on Release 6.07, with brief discussion of differences from Essentially all hardware platforms are covered. TERMINOLOGY Efficiency Elements There are two primary categories of elements to consider when evaluating efficiency: machine and human. Machine efficiency elements include computer processing time (CPU) processing time for reading or writing computer data (I/O for Input/Output) storage space. Storage space becomes a matter of efficiency when you consider the cost of storage hardware. Therefore, a technique that requires less space may be thought of as more cost efficient. Efficiency for humans is related to programmer time, level of expertise required, or clarity of final logic. Major components of programmer time include planning programming strategy, designing database structures, writing or revising code, testing programs, running production programs, and writing documentation. Depending on the type of research involved, database design and storage space planning may also be major programming activities. Always consider both machine and human elements when choosing between options for efficiency reasons. The choice may be difficult, or at least ambiguous, since the more machine efficient option can require additional human effort or vice-versa. In the case of compression, a conflict may also occur between the machine elements of CPU and I/O. Large File Environment In large file environments choosing an efficient programming strategy tends to be important. A large file can be defined as a file for which processing all records is a Significant event. This may apply to files that are short and wide, with relatively few observations but a large number of variables, or long and thin, with only a few variables for many observations. The exact size that qualifies as large depends on the computing environment. In a mainframe environment a file may need to have ten thousand records before being considered large. For a microcomputer, even one thousand records may seem to take a long time to process. Batch processing is used more frequently when dealing with large files.. Compression Terms The terms associated with the COMPRES5= option include uncompressed or non-compressed, also called fixed length compressed, also called variable length decompress. For example: records in a compressed data set are decompressed automatically before use by a SAS procedure. 1412

2 COMPRESSION ALGORITHM The compression algorithm has evolved since the first introduction in Release It is a straightforward algorithm designed for general use. Although more efficient compression algorithms could be applied, an advantage is that essentially no differences exist across different hardware platforms. In order to compress a data set, each observation (record) of a SAS data set is evaluated separately. The record is treated as a sequence of bytes. Distinctions between variables, both type and boundaries, are ignored. The algorithm compresses identical consecutive bytes into two or three bytes. Repeated blanks (3 to 129) or binary zeros (3 to 66) are compressed into two bytes. For other types of repeated values (3 to 63), the compressed result takes three bytes. Once you understand the way the algorithm works, predicting the potential level of savings becomes simpler. For character variables, if many blanks (embedded) or missing values (also blanks) occur, then compression could potentially save a great deal of space. On the other hand, character variables that contain few repeated blanks will compress much less. For numeric variables, savings are most likely for integer values stored in the default LENGTH of 8 bytes. For example, if ANSWER1=1 then seven of the bytes are zeros, which would reduce to two bytes. If ANSWER1=0 and ANSWER2=O and they are stored consecutively then those 16 bytes would shrink to two. For real numbers, essentially those with decimal places, the likelihood of repeating bytes is smaller. Note that no precision is lost when an observation is decompressed for use. Release 6.07 includes a change that substantially improves the time required for decompression. According to tests reported by Beatrous et al. (1992), the improvement can be as much as 60%. The decompression time decreases 41% even for an "average" data set. The process differs in that every field, whether compressed or uncompressed, is prefixed with an indication of field length. The 6.06 method for storing compressed fields requires that all fields in a compressed record be searched for an escape code to decide whether or not decompression is needed. You can create the 6.06 compressed format using Release 6.07 by using the ENGINE= or FILEFMT= option. You can read or modify 6.06 compressed data sets using 6.07 directly. COMPRESSION USAGE Syntax There are two options associated with compressed data sets: COMPRESS = YES I NO REUSE = YES I NO Both can be used as a system option or a data set option. Specifying a data set option overrides the system option setting. The COMPRESS: option applies only to output data sets. You cannot change an existing uncompressed SAS data set into a compressed data set without creating a new data set. After a system option statement OPTIONS COMPRESS: YES; is executed, all created data sets will be compressed. This applies to permanent or temporary data sets. For a program that includes a large number of WORK data sets, the extra CPU required could be substantial. Alternatively, you can compress specific data sets. For example: DATA libname.filename (COMPRESS=YES); SETinname; r Data Creation Statements' / If you are planning to insert, delete, or update observations using a compressed data set and space usage is critical, then the REUSE= option may help. Free space is tracked and reused when you specify REUSE=YES when creating a compressed data set. Using Compressed Data Sets Since decompression is automatic, you can almost forget whether or not a data set is compressed or uncompressed. This applies whether you are using a SET statement in a DATA step or a SAS procedure. To keep track, output from the CONTENTS procedure includes an indication of compression status. Using the POSmON option to review the order of variables is helpful if you want to rearrange variables to maximize repeated bytes for optimal compression of large permanent data sets. Release 6.07 includes two features related to compressed data sets not in The SAS Log contains NOTEs about the amount of savings for a data set as 1413

3 Program code Figure 1. SAS Log Notes (VMS) DATA file.newdata (COMPRESS=YES); SET file.olddata; r DATA CREATION STATEMENTS"' Compress note I RUN; NOTE: The data set FILE.NEWDATA has 1000 observations and 200 variables. NOTE: Compressing data set FILE.COMPDATA decreased size by percent. Compressed is 18 pages; un-compressed would required 36 pages. I t ~ t shown in Figure 1. With the reduction in size of 6.07 uncompressed data sets compared to 6.06 (Beatrous et al.), the percentage savings for 6.07 compressed data sets may not be as large. When the COPY procedure is used, the copy has the same attributes as the original. Thus a copy of a compressed data set is also compressed. In 6.06, whether or not the copy was compressed depended on the system default setting. DECISION FACTORS Disadvantages The primary disadvantage of using compressed data sets is the extra au required for decompression. Muller et al. (1992) compared au usage for compressed and uncompressed versions of a data set under MVS. They found that using the SORT, FREQ, MEANS, or CORR procedures on the compressed data set could take twice as much au. Realling in the compressed observations using SET took almost four times as much au. Since compressed observations are variable length, instead of fixed length, certain standard access ~ethods are not as effective. No direct access using POINT= is available although a work around is possible. You could create and index a variable with the value of _N_. In general, using FIRST. and LAST. processing will be slower. The additional complexity associated with using compressed data sets is a factor to remember. With variable length records, predicting space needs becomes more difficult although estimating an upper limit is possible. Since a l2-byte overhead per compressed record is required, a 6.07 compressed data set could be larger than its '1ean" counterpart, which only has a l-bit overhead. If au usage is also a concern, then a balancing act is required between using compressed and uncompressed data 1414 sets. One approach may be to compress only permanent data sets. For.end-user!,pplications, compression may cause problems if not tested thoroughly. Advantages There are other advantages to compression besides the obvious one of saving storage space. Note that specifying COMPRESS=YES adds no extra human effort when using a compressed data set since decompression is automatic. In addition,.no loss of precision occurs, which guarantees that results are identical to those based on an uncompressed version of the data. Compared to other efficiency techniques, making practical use of compressed data sets requires a relatively smail amount of programmer effort, either for decision making or execution. Compression can enhance other efficiency strategies such as minimizing storage length for variables dropping unneeded variables using indexes when appropriate. You can minimize the space required for specific variables by using LENGTH<8 or storing categorical values as character. Note that using a shorter LENGTH can result in precision problems, especially when data must be transferred across hardware. platforms. Storing categorical. variables such as SEX. as character may be better t~ just decreasing the LENGTH. Always consider using DROP/KEEP to assure that only required variables remain in temporary or permanent data sets. When you use an index to access observations in a compressed data set, only the required observations are decompressed. So by indexing a compressed data set, the amount of extra CPU required for decompression may decrease substantially. Combining any or all of these techniques with compression can efficiently save the most storage space.

4 Figure 2. Thin Versus Wide Space Savings 40 Percent 30 Improvement Short Long --Thin--- Short Long ---Wide-- EXAMPLES Example! As shown in Figure 2, Muller et al. confinned that wide data sets benefit more from compression. They created test data sets under MVS. Short meant 10,000 observations compared to 100,000 in the long data sets. Thin data sets included 20 variables while wide ones contained 110 variables. Regardless of the number of observations, about 45% savings in space occurred, for the wide data sets, which was more than twice as much as the 20% savings for thin data sets. Example 2 Bardsley demonstrated the effect of the compression algorithm by creating data sets emphasizing specific characteristics., The results under AI)( were, comparable to MVS. Wide data sets contained 120 variables. The results by type of value are: Type Numeric Real Single-digit Integer Numeric Missing Character Missing Percent Savings o As expected, data sets with a greater occurrence of repeated bytes benefit the most. Example 3 The table in Figure 3 shows space savings for actual clinical trial data sets under MVS. The data sets are shown sorted by observation length. In ADVERSE and MEDS, the character variables include a large number of blanks. ADVERSE is only half character variables, but they account for 83% of the storage space in fixed length records. Most of the values in EFFBASE are zeros or ones. In general, the wider data sets compressed more effectively with the Figure 3. Space Savings in Clinical Trials Number Record % Character % Space Data set of Variables Length Vars Length Saved 1) VITALS ) EFFSUMS ) ADVERSE ) MEDS' ) EFFBASE ) LABS

5 exception of LABS. However, this is not a surprise since lab data consists mostly of real numbers. CONCLUSION Generally, the trade offs are clear when evaluating whether or not to compress data sets. You must choose between the importance of saving space versus using the least amount of CPU, assuming both save money. The chacteristics of obvious compression situations include any or all of the following: not enough space available wide observations with repeated values "blank" character variables expected infrequent processing of large data sets subset processing of large indexed data sets >50% savings derr.onstrated. In any Release 6.07 computing environment, you can benefit from using the COMPRESS: option as long as you compress data sets deliberately. RECOMMENDED READING Compression Bardsley, P., (1993), "Space-saving Tools: Compression in SAS 6.07," Proc. of the First Annual Southeast SAS Users Group Conference, Cary, NC: SAS Institute Inc., Beatrous, S. and Stokes, J.T., (1992), "I/O Performance Improvements in Release 6.07 of the SAS System under MVS, CMS, and VMS," Proc. of the Seuenteenth Annual SAS Group Inti. Conference, Cary, NC: SAS Institute Inc., Clifford, W., Beatrous, S., Stokes, J.T., and Mosmon, K. (1989), "Using New SAS Database Features and Options," Proc. of the Fourteenth Annual SAS Group IntI. Conference, Cary, NC: SAS Institute Inc., Muller, S.M., Hardy, K.A., and Brown, K.J., (1992), "Getting It And Keeping It With SAS Software," Proc. of the Seuenteenth Annual SAS Group IntI. Conference, Cary,NC: SAS Institute Inc., SAS Institute Inc. (1990), SAS Companion for the VMS Environment, Version 6, First Edition, Cary, NC: SAS Institute Inc. (Compression related topics on pp. 137, 141,143,345,408.) SAS Institute Inc. (1990), SAS Language: Reference, Version 6, First Edition, Cary, NC: SAS Institute Inc. COMPRESS: option, pp ; REUSE= option, pp Efficiency Techniques Howard, N. (1991), "Efficiency Techniques for Improving I/O and Processing Time in the DATA Step," Proc. of the Fourteenth Annual SAS Group IntI. Conference, Cary, NC: SAS Institute Inc., Ma, J.M. (1991), "Effieency Revisited: Large Files and Release 6.06," Proc. of the Sixteenth Annual SAS Group IntI. Conference, Cary, NC: SAS Institute Inc., Mackiernan, Y.D. (1989), "Don't Do Anything You Don't Have To: Elementary Strategies For Processing Large Data Sets With SAS Software," Proc. of the Fourteenth Annual SAS Group IntI. Conference, Cary, NC: SAS Institute Inc., Muller, K.E., Smith, J., and Bass, J. (1982), "Managing 'not small' Datasets in a Research Environment:' Proc. of the Seuenth Annual SAS Users Group IntI. Conference, Cary, NC: SAS Institute Inc., ' SAS Institute Inc. (1990), SAS Programming Tips: A Guide to Efficient SAS Processing, Cary, NC: SAS' Institute Inc. Smith, u. (1991), "Effieent Use of Numeric and Character Data Types," Proc. of the Sixteenth Annual SAS Group Inti. Conference, Cary, NC: SAS Institute Inc., CONTACT ADDRESS Dr. J. Meimei Ma Quintiles P. O. Box Research Triangle Park, NC SAS is a registered trademark of SAS InstibJte Inc. in the USA and other countries. lid indicates USA registra~on. 1416

Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX

Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX 1/0 Performance Improvements in Release 6.07 of the SAS System under MVS, ems, and VMS' Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX INTRODUCTION The

More information

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software 177 APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software Authors 178 Abstract 178 Overview 178 The SAS Data Library Model 179 How Data Flows When You Use SAS Files 179 SAS Data Files 179

More information

Efficiency Ideas For Large Files

Efficiency Ideas For Large Files Efficiency Ideas For Large Files J. Meimei Ma, Quintiles, Research Triangle Park, NC Andrew H. Karp, Sierra Information Services, Inc., San Francisco, CA INTRODUCTION This tutorial presents options you

More information

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Indexing and Compressing SAS Data Sets: How, Why, and Why Not Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Many users of SAS System software, especially those working

More information

Macros I Use Every Day (And You Can, Too!)

Macros I Use Every Day (And You Can, Too!) Paper 2500-2018 Macros I Use Every Day (And You Can, Too!) Joe DeShon ABSTRACT SAS macros are a powerful tool which can be used in all stages of SAS program development. Like most programmers, I have collected

More information

Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians

Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians ABSTRACT Karthik Chidambaram, Senior Program Director, Data Strategy, Genentech, CA This paper will provide tips and techniques

More information

Chapter 1. Introduction to Indexes

Chapter 1. Introduction to Indexes Chapter 1 Introduction to Indexes The Index Concept 2 The Index as a SAS Performance Tool 2 Types of SAS Applications That May Benefit from Indexes 4 How SAS Indexes Are Structured 4 Types of SAS Indexes

More information

Statistics, Data Analysis & Econometrics

Statistics, Data Analysis & Econometrics ST009 PROC MI as the Basis for a Macro for the Study of Patterns of Missing Data Carl E. Pierchala, National Highway Traffic Safety Administration, Washington ABSTRACT The study of missing data patterns

More information

Efficient Use of SAS' Data Set Indexes in SAS' Applications

Efficient Use of SAS' Data Set Indexes in SAS' Applications Efficient Use of SAS' Data Set Indexes in SAS' Applications Sally Painter, SAS Institute Inc., Cary, NC ABSTRACT By indexing your SAS data sets, you can run certain types of apptications more efficiently.

More information

Quality Control of Clinical Data Listings with Proc Compare

Quality Control of Clinical Data Listings with Proc Compare ABSTRACT Quality Control of Clinical Data Listings with Proc Compare Robert Bikwemu, Pharmapace, Inc., San Diego, CA Nicole Wallstedt, Pharmapace, Inc., San Diego, CA Checking clinical data listings with

More information

SOS (Save Our Space) Matters of Size

SOS (Save Our Space) Matters of Size SOS (Save Our Space) Matters of Size By Matthew Pearce Amadeus Software Limited 2001 Abstract Disk space is one of the most critical issues when handling large amounts of data. Large data means greater

More information

Optimizing System Performance

Optimizing System Performance 243 CHAPTER 19 Optimizing System Performance Definitions 243 Collecting and Interpreting Performance Statistics 244 Using the FULLSTIMER and STIMER System Options 244 Interpreting FULLSTIMER and STIMER

More information

SYSTEM 2000 Essentials

SYSTEM 2000 Essentials 7 CHAPTER 2 SYSTEM 2000 Essentials Introduction 7 SYSTEM 2000 Software 8 SYSTEM 2000 Databases 8 Database Name 9 Labeling Data 9 Grouping Data 10 Establishing Relationships between Schema Records 10 Logical

More information

The DATA Statement: Efficiency Techniques

The DATA Statement: Efficiency Techniques The DATA Statement: Efficiency Techniques S. David Riba, JADE Tech, Inc., Clearwater, FL ABSTRACT One of those SAS statements that everyone learns in the first day of class, the DATA statement rarely gets

More information

FSEDIT Procedure Windows

FSEDIT Procedure Windows 25 CHAPTER 4 FSEDIT Procedure Windows Overview 26 Viewing and Editing Observations 26 How the Control Level Affects Editing 27 Scrolling 28 Adding Observations 28 Entering and Editing Variable Values 28

More information

Batch vs. Interactive: Why You Need Both Janet E. Stuelpner. ASG. Inc Cary. North Carolina

Batch vs. Interactive: Why You Need Both Janet E. Stuelpner. ASG. Inc Cary. North Carolina Batch vs. Interactive: Why You Need Both Janet E. Stuelpner. ASG. Inc Cary. North Carolina ABSTRACT error was small fa semi-colon was omitted or a closing quotation mark was missing), but caused the program

More information

PharmaSUG Paper BB01

PharmaSUG Paper BB01 PharmaSUG 2014 - Paper BB01 Indexing: A powerful technique for improving efficiency Arun Raj Vidhyadharan, inventiv Health, Somerset, NJ Sunil Mohan Jairath, inventiv Health, Somerset, NJ ABSTRACT The

More information

Square Peg, Square Hole Getting Tables to Fit on Slides in the ODS Destination for PowerPoint

Square Peg, Square Hole Getting Tables to Fit on Slides in the ODS Destination for PowerPoint PharmaSUG 2018 - Paper DV-01 Square Peg, Square Hole Getting Tables to Fit on Slides in the ODS Destination for PowerPoint Jane Eslinger, SAS Institute Inc. ABSTRACT An output table is a square. A slide

More information

Imelda C. Go, South Carolina Department of Education, Columbia, SC

Imelda C. Go, South Carolina Department of Education, Columbia, SC PO 082 Rounding in SAS : Preventing Numeric Representation Problems Imelda C. Go, South Carolina Department of Education, Columbia, SC ABSTRACT As SAS programmers, we come from a variety of backgrounds.

More information

Integers. N = sum (b i * 2 i ) where b i = 0 or 1. This is called unsigned binary representation. i = 31. i = 0

Integers. N = sum (b i * 2 i ) where b i = 0 or 1. This is called unsigned binary representation. i = 31. i = 0 Integers So far, we've seen how to convert numbers between bases. How do we represent particular kinds of data in a certain (32-bit) architecture? We will consider integers floating point characters What

More information

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY ABSTRACT Data set options are an often over-looked feature when querying and manipulating SAS

More information

My grandfather was an Arctic explorer,

My grandfather was an Arctic explorer, Explore the possibilities A Teradata Certified Master answers readers technical questions. Carrie Ballinger Senior database analyst Teradata Certified Master My grandfather was an Arctic explorer, and

More information

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board SAS PROGRAM EFFICIENCY FOR BEGINNERS Bruce Gilsen, Federal Reserve Board INTRODUCTION This paper presents simple efficiency techniques that can benefit inexperienced SAS software users on all platforms.

More information

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board SAS PROGRAM EFFICIENCY FOR BEGINNERS Bruce Gilsen, Federal Reserve Board INTRODUCTION This paper presents simple efficiency techniques that can benefit inexperienced SAS software users on all platforms.

More information

Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board SAS PROGRAM EFFICIENCY FOR BEGINNERS Bruce Gilsen, Federal Reserve Board INTRODUCTION This paper presents simple efficiency techniques that can benefit inexperienced SAS software users on all platforms

More information

Cleaning up your SAS log: Note Messages

Cleaning up your SAS log: Note Messages Paper 9541-2016 Cleaning up your SAS log: Note Messages ABSTRACT Jennifer Srivastava, Quintiles Transnational Corporation, Durham, NC As a SAS programmer, you probably spend some of your time reading and

More information

Characteristics of a "Successful" Application.

Characteristics of a Successful Application. Characteristics of a "Successful" Application. Caroline Bahler, Meridian Software, Inc. Abstract An application can be judged "successful" by two different sets of criteria. The first set of criteria belongs

More information

Getting it Done with PROC TABULATE

Getting it Done with PROC TABULATE ABSTRACT Getting it Done with PROC TABULATE Michael J. Williams, ICON Clinical Research, San Francisco, CA The task of displaying statistical summaries of different types of variables in a single table

More information

Formats. Formats Under UNIX. HEXw. format. $HEXw. format. Details CHAPTER 11

Formats. Formats Under UNIX. HEXw. format. $HEXw. format. Details CHAPTER 11 193 CHAPTER 11 Formats Formats Under UNIX 193 Formats Under UNIX This chapter describes SAS formats that have behavior or syntax that is specific to UNIX environments. Each format description includes

More information

SAS Macros for Grouping Count and Its Application to Enhance Your Reports

SAS Macros for Grouping Count and Its Application to Enhance Your Reports SAS Macros for Grouping Count and Its Application to Enhance Your Reports Shi-Tao Yeh, EDP Contract Services, Bala Cynwyd, PA ABSTRACT This paper provides two SAS macros, one for one grouping variable,

More information

Utilizing the Stored Compiled Macro Facility in a Multi-user Clinical Trial Setting

Utilizing the Stored Compiled Macro Facility in a Multi-user Clinical Trial Setting Paper AD05 Utilizing the Stored Compiled Macro Facility in a Multi-user Clinical Trial Setting Mirjana Stojanovic, Duke University Medical Center, Durham, NC Dorothy Watson, Duke University Medical Center,

More information

Paper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation

Paper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation Paper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation ABSTRACT Data that contain multiple observations per case are called repeated measures

More information

Merge Processing and Alternate Table Lookup Techniques Prepared by

Merge Processing and Alternate Table Lookup Techniques Prepared by Merge Processing and Alternate Table Lookup Techniques Prepared by The syntax for data step merging is as follows: International SAS Training and Consulting This assumes that the incoming data sets are

More information

An Introduction to SAS/FSP Software Terry Fain, RAND, Santa Monica, California Cyndie Gareleck, RAND, Santa Monica, California

An Introduction to SAS/FSP Software Terry Fain, RAND, Santa Monica, California Cyndie Gareleck, RAND, Santa Monica, California An Introduction to SAS/FSP Software Terry Fain, RAND, Santa Monica, California Cyndie Gareleck, RAND, Santa Monica, California ABSTRACT SAS/FSP is a set of procedures used to perform full-screen interactive

More information

Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables?

Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables? Paper SAS 1866-2015 Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables? Steven Sober, SAS Institute Inc. ABSTRACT Well, Hadoop community, now that you have your data

More information

An Animated Guide: Proc Transpose

An Animated Guide: Proc Transpose ABSTRACT An Animated Guide: Proc Transpose Russell Lavery, Independent Consultant If one can think about a SAS data set as being made up of columns and rows one can say Proc Transpose flips the columns

More information

Simple Rules to Remember When Working with Indexes

Simple Rules to Remember When Working with Indexes Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, CA Abstract SAS users are always interested in learning techniques related to improving

More information

SAS Job Monitor 2.2. About SAS Job Monitor. Overview. SAS Job Monitor for SAS Data Integration Studio

SAS Job Monitor 2.2. About SAS Job Monitor. Overview. SAS Job Monitor for SAS Data Integration Studio SAS Job Monitor 2.2 About SAS Job Monitor Overview SAS Job Monitor is a component of SAS Environment Manager that integrates information from SAS Data Integration Studio, DataFlux Data Management Server,

More information

PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING

PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING Karuna Nerurkar and Andrea Robertson, GMIS Inc. ABSTRACT Proc Format can be a useful tool for improving programming efficiency. This paper

More information

Topic C. Communicating the Precision of Measured Numbers

Topic C. Communicating the Precision of Measured Numbers Topic C. Communicating the Precision of Measured Numbers C. page 1 of 14 Topic C. Communicating the Precision of Measured Numbers This topic includes Section 1. Reporting measurements Section 2. Rounding

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

The Ins and Outs of %IF

The Ins and Outs of %IF Paper 1135-2017 The Ins and Outs of %IF M. Michelle Buchecker, ThotWave Technologies, LLC. ABSTRACT Have you ever had your macro code not work and you couldn't figure out why? Even something as simple

More information

A Guided Tour Through the SAS Windowing Environment Casey Cantrell, Clarion Consulting, Los Angeles, CA

A Guided Tour Through the SAS Windowing Environment Casey Cantrell, Clarion Consulting, Los Angeles, CA A Guided Tour Through the SAS Windowing Environment Casey Cantrell, Clarion Consulting, Los Angeles, CA ABSTRACT The SAS system running in the Microsoft Windows environment contains a multitude of tools

More information

PH006 Audit Trails of SAS Data Set Changes An Overview Maria Y. Reiss, Wyeth Pharmaceuticals, Collegeville, PA

PH006 Audit Trails of SAS Data Set Changes An Overview Maria Y. Reiss, Wyeth Pharmaceuticals, Collegeville, PA PH006 Audit Trails of SAS Data Set Changes An Overview Maria Y. Reiss, Wyeth, Collegeville, PA ABSTRACT SAS programmers often have to modify data in SAS data sets. When modifying data, it is desirable

More information

A Quick and Gentle Introduction to PROC SQL

A Quick and Gentle Introduction to PROC SQL ABSTRACT Paper B2B 9 A Quick and Gentle Introduction to PROC SQL Shane Rosanbalm, Rho, Inc. Sam Gillett, Rho, Inc. If you are afraid of SQL, it is most likely because you haven t been properly introduced.

More information

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX ABSTRACT Symmetric multiprocessor (SMP) computers can increase performance by reducing the time required to analyze large volumes

More information

Real Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers

Real Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers Real Numbers We have been studying integer arithmetic up to this point. We have discovered that a standard computer can represent a finite subset of the infinite set of integers. The range is determined

More information

Ten tips for efficient SAS code

Ten tips for efficient SAS code Ten tips for efficient SAS code Host Caroline Scottow Presenter Peter Hobart Managing the webinar In Listen Mode Control bar opened with the white arrow in the orange box Efficiency Overview Optimisation

More information

Beginner Beware: Hidden Hazards in SAS Coding

Beginner Beware: Hidden Hazards in SAS Coding ABSTRACT SESUG Paper 111-2017 Beginner Beware: Hidden Hazards in SAS Coding Alissa Wise, South Carolina Department of Education New SAS programmers rely on errors, warnings, and notes to discover coding

More information

Performance Considerations

Performance Considerations 149 CHAPTER 6 Performance Considerations Hardware Considerations 149 Windows Features that Optimize Performance 150 Under Windows NT 150 Under Windows NT Server Enterprise Edition 4.0 151 Processing SAS

More information

Data Set Options CHAPTER 2

Data Set Options CHAPTER 2 5 CHAPTER 2 Data Set Options Definition 6 6 Using Data Set Options 6 Using Data Set Options with Input or Output SAS Data Sets 6 How Data Set Options Interact with System Options 7 Data Set Options by

More information

Data Set Options. Specify a data set option in parentheses after a SAS data set name. To specify several data set options, separate them with spaces.

Data Set Options. Specify a data set option in parentheses after a SAS data set name. To specify several data set options, separate them with spaces. 23 CHAPTER 4 Data Set Options Definition 23 Syntax 23 Using Data Set Options 24 Using Data Set Options with Input or Output SAS Data Sets 24 How Data Set Options Interact with System Options 24 Data Set

More information

An Introduc+on to Computers and Java CSC 121 Spring 2017 Howard Rosenthal

An Introduc+on to Computers and Java CSC 121 Spring 2017 Howard Rosenthal An Introduc+on to Computers and Java CSC 121 Spring 2017 Howard Rosenthal Lesson Goals Learn the basic terminology of a computer system Understand the basics of high level languages, including Java Understand

More information

Using SAS/SHARE More Efficiently

Using SAS/SHARE More Efficiently Using More Efficiently by Philip R Holland, Holland Numerics Ltd, UK Abstract is a very powerful product which allow concurrent access to SAS Datasets for reading and updating. However, if not used with

More information

WORKSHOP: Using the Health Survey for England, 2014

WORKSHOP: Using the Health Survey for England, 2014 WORKSHOP: Using the Health Survey for England, 2014 There are three sections to this workshop, each with a separate worksheet. The worksheets are designed to be accessible to those who have no prior experience

More information

Real-Time Standards (RTS) Version 4.10 General Information Manual

Real-Time Standards (RTS) Version 4.10 General Information Manual Real-Time Standards (RTS) Version 4.10 General Information Manual Copyright 2008 INTERCHIP AG Page 1 of 8 RTS GENERAL INFORMATION MANUAL This Manual introduces the highlights of RTS, and provides an overview

More information

SAS 101. Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23. By Tasha Chapman, Oregon Health Authority

SAS 101. Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23. By Tasha Chapman, Oregon Health Authority SAS 101 Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23 By Tasha Chapman, Oregon Health Authority Topics covered All the leftovers! Infile options Missover LRECL=/Pad/Truncover

More information

Version 6 and Version 7: A Peaceful Co-Existence Steve Beatrous and James Holman, SAS Institute Inc., Cary, NC

Version 6 and Version 7: A Peaceful Co-Existence Steve Beatrous and James Holman, SAS Institute Inc., Cary, NC Version 6 and Version 7: A Peaceful Co-Existence Steve Beatrous and James Holman, SAS Institute Inc., Cary, NC Abstract Version 7 represents a major step forward for SAS Institute and is the first release

More information

ABSTRACT INTRODUCTION WHERE TO START? 1. DATA CHECK FOR CONSISTENCIES

ABSTRACT INTRODUCTION WHERE TO START? 1. DATA CHECK FOR CONSISTENCIES Developing Integrated Summary of Safety Database using CDISC Standards Rajkumar Sharma, Genentech Inc., A member of the Roche Group, South San Francisco, CA ABSTRACT Most individual trials are not powered

More information

SAS Online Training: Course contents: Agenda:

SAS Online Training: Course contents: Agenda: SAS Online Training: Course contents: Agenda: (1) Base SAS (6) Clinical SAS Online Training with Real time Projects (2) Advance SAS (7) Financial SAS Training Real time Projects (3) SQL (8) CV preparation

More information

Going Under the Hood: How Does the Macro Processor Really Work?

Going Under the Hood: How Does the Macro Processor Really Work? Going Under the Hood: How Does the Really Work? ABSTRACT Lisa Lyons, PPD, Inc Hamilton, NJ Did you ever wonder what really goes on behind the scenes of the macro processor, or how it works with other parts

More information

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC Paper CC-05 Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC ABSTRACT For many SAS users, learning SQL syntax appears to be a significant effort with a low

More information

Week 6, Week 7 and Week 8 Analyses of Variance

Week 6, Week 7 and Week 8 Analyses of Variance Week 6, Week 7 and Week 8 Analyses of Variance Robyn Crook - 2008 In the next few weeks we will look at analyses of variance. This is an information-heavy handout so take your time reading it, and don

More information

Managing your metadata efficiently - a structured way to organise and frontload your analysis and submission data

Managing your metadata efficiently - a structured way to organise and frontload your analysis and submission data Paper TS06 Managing your metadata efficiently - a structured way to organise and frontload your analysis and submission data Kirsten Walther Langendorf, Novo Nordisk A/S, Copenhagen, Denmark Mikkel Traun,

More information

Handling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC

Handling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC Paper BB-206 Handling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC ABSTRACT Every SAS programmer knows that

More information

Mastering the Basics: Preventing Problems by Understanding How SAS Works. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Mastering the Basics: Preventing Problems by Understanding How SAS Works. Imelda C. Go, South Carolina Department of Education, Columbia, SC SESUG 2012 ABSTRACT Paper PO 06 Mastering the Basics: Preventing Problems by Understanding How SAS Works Imelda C. Go, South Carolina Department of Education, Columbia, SC There are times when SAS programmers

More information

Introduction to SAS. Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC

Introduction to SAS. Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC Introduction to SAS Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC cmurray-krezan@salud.unm.edu 20 August 2018 What is SAS? Statistical Analysis System,

More information

Empowering the SAS Programmer: Understanding Basic Microsoft Windows Performance Metrics by Customizing the Data Results in SAS/GRAPH Software

Empowering the SAS Programmer: Understanding Basic Microsoft Windows Performance Metrics by Customizing the Data Results in SAS/GRAPH Software Paper SAS406-2014 Empowering the SAS Programmer: Understanding Basic Microsoft Windows Performance Metrics by Customizing the Data Results in SAS/GRAPH Software John Maxwell, SAS Institute Inc. ABSTRACT

More information

IBM 370 Basic Data Types

IBM 370 Basic Data Types IBM 370 Basic Data Types This lecture discusses the basic data types used on the IBM 370, 1. Two s complement binary numbers 2. EBCDIC (Extended Binary Coded Decimal Interchange Code) 3. Zoned Decimal

More information

One-PROC-Away: The Essence of an Analysis Database Russell W. Helms, Ph.D. Rho, Inc.

One-PROC-Away: The Essence of an Analysis Database Russell W. Helms, Ph.D. Rho, Inc. One-PROC-Away: The Essence of an Analysis Database Russell W. Helms, Ph.D. Rho, Inc. Chapel Hill, NC RHelms@RhoWorld.com www.rhoworld.com Presented to ASA/JSM: San Francisco, August 2003 One-PROC-Away

More information

USING PROC SQL EFFECTIVELY WITH SAS DATA SETS JIM DEFOOR LOCKHEED FORT WORTH COMPANY

USING PROC SQL EFFECTIVELY WITH SAS DATA SETS JIM DEFOOR LOCKHEED FORT WORTH COMPANY USING PROC SQL EFFECTIVELY WITH SAS DATA SETS JIM DEFOOR LOCKHEED FORT WORTH COMPANY INTRODUCTION This paper is a beginning tutorial on reading and reporting Indexed SAS Data Sets with PROC SQL. Its examples

More information

SAS/ASSIST Software Setup

SAS/ASSIST Software Setup 173 APPENDIX 3 SAS/ASSIST Software Setup Appendix Overview 173 Setting Up Graphics Devices 173 Setting Up Remote Connect Configurations 175 Adding a SAS/ASSIST Button to Your Toolbox 176 Setting Up HTML

More information

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA ABSTRACT: A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA Programmers often need to summarize data into tables as per template. But study

More information

Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA

Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA SESUG 2012 Paper HW-01 Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA ABSTRACT Learning the basics of PROC REPORT can help the new SAS user avoid hours of headaches.

More information

PC and Windows Installation 32 and 64 bit Operating Systems

PC and Windows Installation 32 and 64 bit Operating Systems SUDAAN Installation Guide PC and Windows Installation 32 and 64 bit Operating Systems Release 11.0.1 Copyright 2013 by RTI International P.O. Box 12194 Research Triangle Park, NC 27709 All rights reserved.

More information

Cluster Randomization Create Cluster Means Dataset

Cluster Randomization Create Cluster Means Dataset Chapter 270 Cluster Randomization Create Cluster Means Dataset Introduction A cluster randomization trial occurs when whole groups or clusters of individuals are treated together. Examples of such clusters

More information

CS15100 Lab 7: File compression

CS15100 Lab 7: File compression C151 Lab 7: File compression Fall 26 November 14, 26 Complete the first 3 chapters (through the build-huffman-tree function) in lab (optionally) with a partner. The rest you must do by yourself. Write

More information

Numeric Precision 101

Numeric Precision 101 www.sas.com > Service and Support > Technical Support TS Home Intro to Services News and Info Contact TS Site Map FAQ Feedback TS-654 Numeric Precision 101 This paper is intended as a basic introduction

More information

Defining Test Data Using Population Analysis Clarence Wm. Jackson, CQA - City of Dallas CIS

Defining Test Data Using Population Analysis Clarence Wm. Jackson, CQA - City of Dallas CIS Defining Test Data Using Population Analysis Clarence Wm. Jackson, CQA - City of Dallas CIS Abstract Defining test data that provides complete test case coverage requires the tester to accumulate data

More information

Guidance for building Study and CRF in OpenClinica

Guidance for building Study and CRF in OpenClinica Guidance for building Study and CRF in OpenClinica 1. Use of Patient Identifying information Patient Identifying Data (PID) is any data within clinical data that could potentially be used to identify subjects,

More information

Lecture 1 Getting Started with SAS

Lecture 1 Getting Started with SAS SAS for Data Management, Analysis, and Reporting Lecture 1 Getting Started with SAS Portions reproduced with permission of SAS Institute Inc., Cary, NC, USA Goals of the course To provide skills required

More information

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY Table Lookups in the SAS Data Step Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY Introduction - What is a Table Lookup? You have a sales file with one observation for

More information

Input Space Partitioning

Input Space Partitioning CMPT 473 Software Quality Assurance Input Space Partitioning Nick Sumner Recall Testing involves running software and comparing observed behavior against expected behavior Select an input, look at the

More information

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency. Paper 76-28 Comparative Efficiency of SQL and Base Code When Reading from Database Tables and Existing Data Sets Steven Feder, Federal Reserve Board, Washington, D.C. ABSTRACT In this paper we compare

More information

Tape Drive Data Compression Q & A

Tape Drive Data Compression Q & A Tape Drive Data Compression Q & A Question What is data compression and how does compression work? Data compression permits increased storage capacities by using a mathematical algorithm that reduces redundant

More information

Design Issues 1 / 36. Local versus Global Allocation. Choosing

Design Issues 1 / 36. Local versus Global Allocation. Choosing Design Issues 1 / 36 Local versus Global Allocation When process A has a page fault, where does the new page frame come from? More precisely, is one of A s pages reclaimed, or can a page frame be taken

More information

Paper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA

Paper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA ABSTRACT Paper 236-28 An Automated Reporting Macro to Create Cell Index An Enhanced Revisit When generating tables from SAS PROC TABULATE or PROC REPORT to summarize data, sometimes it is necessary to

More information

Machine Architecture and Number Systems CMSC104. Von Neumann Machine. Major Computer Components. Schematic Diagram of a Computer. First Computer?

Machine Architecture and Number Systems CMSC104. Von Neumann Machine. Major Computer Components. Schematic Diagram of a Computer. First Computer? CMSC104 Lecture 2 Remember to report to the lab on Wednesday Topics Machine Architecture and Number Systems Major Computer Components Bits, Bytes, and Words The Decimal Number System The Binary Number

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

Integrated Safety Reporting Anemone Thalmann elba - GEIGY Ltd (PH3.25), Basel

Integrated Safety Reporting Anemone Thalmann elba - GEIGY Ltd (PH3.25), Basel ntegrated Safety Reporting Anemone Thalmann elba - GEGY Ltd (PH3.25), Basel Abstract: Most of the regulatory health authorities approving pharmaceutical products consider the ntegrated Safety Summary to

More information

Improving Productivity with Parameters

Improving Productivity with Parameters Improving Productivity with Parameters Michael Trull Rocky Brown Thursday, January 25, 2007 Improving Productivity with Parameters Part I The Fundamentals Parameters are variables which define the size

More information

Ten Great Reasons to Learn SAS Software's SQL Procedure

Ten Great Reasons to Learn SAS Software's SQL Procedure Ten Great Reasons to Learn SAS Software's SQL Procedure Kirk Paul Lafler, Software Intelligence Corporation ABSTRACT The SQL Procedure has so many great features for both end-users and programmers. It's

More information

Data Compression in Blackbaud CRM Databases

Data Compression in Blackbaud CRM Databases Data Compression in Blackbaud CRM Databases Len Wyatt Enterprise Performance Team Executive Summary... 1 Compression in SQL Server... 2 Perform Compression in Blackbaud CRM Databases... 3 Initial Compression...

More information

Getting Information from a Table

Getting Information from a Table ch02.fm Page 45 Wednesday, April 14, 1999 2:44 PM Chapter 2 Getting Information from a Table This chapter explains the basic technique of getting the information you want from a table when you do not want

More information

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE ABSTRACT Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE Steve Cavill, NSW Bureau of Crime Statistics and Research, Sydney, Australia PROC TABULATE is a great tool for generating

More information

SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite Paper SAS1952-2015 SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite Jason Shoffner, SAS Institute Inc., Cary, NC ABSTRACT Once you have a SAS Visual

More information

Using Images in FF&EZ within a Citrix Environment

Using Images in FF&EZ within a Citrix Environment 1 Using Images in FF&EZ within a Citrix Environment This document explains how to add images to specifications, and covers the situation where the FF&E database is on a remote server instead of your local

More information

Binary, Hexadecimal and Octal number system

Binary, Hexadecimal and Octal number system Binary, Hexadecimal and Octal number system Binary, hexadecimal, and octal refer to different number systems. The one that we typically use is called decimal. These number systems refer to the number of

More information

Chapter 3 Data Representation

Chapter 3 Data Representation Chapter 3 Data Representation The focus of this chapter is the representation of data in a digital computer. We begin with a review of several number systems (decimal, binary, octal, and hexadecimal) and

More information

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U? Paper 54-25 How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U? Andrew T. Kuligowski Nielsen Media Research Abstract / Introduction S-M-U. Some people will see these three letters and

More information