An Introduction to Compressing Data Sets J. Meimei Ma, Quintiles
|
|
- Gertrude Wilkins
- 5 years ago
- Views:
Transcription
1 An Introduction to Compressing Data Sets J. Meimei Ma, Quintiles r:, INTRODUCTION This tutorial introduces compressed data sets. The SAS system compression algorithm is described along with basic syntax. The efficiency trade offs between saving space and CPU (machine "thinking" time) are explored. Examples of the level of possible space savings are presented. The goal is to provide fundamental knowledge in order to encourage deliberate consideration of the COMPRESS: option. The topic is limited to the COMPRESS= option for SAS data sets. No discussion is included about operating system compression utilities such as ZIP. See Bardsley (1993) for more information about other compression alternatives. Note that if you use tape storage, compression is not an option. Do not confuse the COMPRESS= option with the COMPRESS function for character variables. The primary target audience for this topic includes programmers, data managers, statisticians, and other people who make decisions related to storing large data sets or saving space in general. Only basic knowledge of DATA step programming in the SAS system is assumed. The focus is on Release 6.07, with brief discussion of differences from Essentially all hardware platforms are covered. TERMINOLOGY Efficiency Elements There are two primary categories of elements to consider when evaluating efficiency: machine and human. Machine efficiency elements include computer processing time (CPU) processing time for reading or writing computer data (I/O for Input/Output) storage space. Storage space becomes a matter of efficiency when you consider the cost of storage hardware. Therefore, a technique that requires less space may be thought of as more cost efficient. Efficiency for humans is related to programmer time, level of expertise required, or clarity of final logic. Major components of programmer time include planning programming strategy, designing database structures, writing or revising code, testing programs, running production programs, and writing documentation. Depending on the type of research involved, database design and storage space planning may also be major programming activities. Always consider both machine and human elements when choosing between options for efficiency reasons. The choice may be difficult, or at least ambiguous, since the more machine efficient option can require additional human effort or vice-versa. In the case of compression, a conflict may also occur between the machine elements of CPU and I/O. Large File Environment In large file environments choosing an efficient programming strategy tends to be important. A large file can be defined as a file for which processing all records is a Significant event. This may apply to files that are short and wide, with relatively few observations but a large number of variables, or long and thin, with only a few variables for many observations. The exact size that qualifies as large depends on the computing environment. In a mainframe environment a file may need to have ten thousand records before being considered large. For a microcomputer, even one thousand records may seem to take a long time to process. Batch processing is used more frequently when dealing with large files.. Compression Terms The terms associated with the COMPRES5= option include uncompressed or non-compressed, also called fixed length compressed, also called variable length decompress. For example: records in a compressed data set are decompressed automatically before use by a SAS procedure. 1412
2 COMPRESSION ALGORITHM The compression algorithm has evolved since the first introduction in Release It is a straightforward algorithm designed for general use. Although more efficient compression algorithms could be applied, an advantage is that essentially no differences exist across different hardware platforms. In order to compress a data set, each observation (record) of a SAS data set is evaluated separately. The record is treated as a sequence of bytes. Distinctions between variables, both type and boundaries, are ignored. The algorithm compresses identical consecutive bytes into two or three bytes. Repeated blanks (3 to 129) or binary zeros (3 to 66) are compressed into two bytes. For other types of repeated values (3 to 63), the compressed result takes three bytes. Once you understand the way the algorithm works, predicting the potential level of savings becomes simpler. For character variables, if many blanks (embedded) or missing values (also blanks) occur, then compression could potentially save a great deal of space. On the other hand, character variables that contain few repeated blanks will compress much less. For numeric variables, savings are most likely for integer values stored in the default LENGTH of 8 bytes. For example, if ANSWER1=1 then seven of the bytes are zeros, which would reduce to two bytes. If ANSWER1=0 and ANSWER2=O and they are stored consecutively then those 16 bytes would shrink to two. For real numbers, essentially those with decimal places, the likelihood of repeating bytes is smaller. Note that no precision is lost when an observation is decompressed for use. Release 6.07 includes a change that substantially improves the time required for decompression. According to tests reported by Beatrous et al. (1992), the improvement can be as much as 60%. The decompression time decreases 41% even for an "average" data set. The process differs in that every field, whether compressed or uncompressed, is prefixed with an indication of field length. The 6.06 method for storing compressed fields requires that all fields in a compressed record be searched for an escape code to decide whether or not decompression is needed. You can create the 6.06 compressed format using Release 6.07 by using the ENGINE= or FILEFMT= option. You can read or modify 6.06 compressed data sets using 6.07 directly. COMPRESSION USAGE Syntax There are two options associated with compressed data sets: COMPRESS = YES I NO REUSE = YES I NO Both can be used as a system option or a data set option. Specifying a data set option overrides the system option setting. The COMPRESS: option applies only to output data sets. You cannot change an existing uncompressed SAS data set into a compressed data set without creating a new data set. After a system option statement OPTIONS COMPRESS: YES; is executed, all created data sets will be compressed. This applies to permanent or temporary data sets. For a program that includes a large number of WORK data sets, the extra CPU required could be substantial. Alternatively, you can compress specific data sets. For example: DATA libname.filename (COMPRESS=YES); SETinname; r Data Creation Statements' / If you are planning to insert, delete, or update observations using a compressed data set and space usage is critical, then the REUSE= option may help. Free space is tracked and reused when you specify REUSE=YES when creating a compressed data set. Using Compressed Data Sets Since decompression is automatic, you can almost forget whether or not a data set is compressed or uncompressed. This applies whether you are using a SET statement in a DATA step or a SAS procedure. To keep track, output from the CONTENTS procedure includes an indication of compression status. Using the POSmON option to review the order of variables is helpful if you want to rearrange variables to maximize repeated bytes for optimal compression of large permanent data sets. Release 6.07 includes two features related to compressed data sets not in The SAS Log contains NOTEs about the amount of savings for a data set as 1413
3 Program code Figure 1. SAS Log Notes (VMS) DATA file.newdata (COMPRESS=YES); SET file.olddata; r DATA CREATION STATEMENTS"' Compress note I RUN; NOTE: The data set FILE.NEWDATA has 1000 observations and 200 variables. NOTE: Compressing data set FILE.COMPDATA decreased size by percent. Compressed is 18 pages; un-compressed would required 36 pages. I t ~ t shown in Figure 1. With the reduction in size of 6.07 uncompressed data sets compared to 6.06 (Beatrous et al.), the percentage savings for 6.07 compressed data sets may not be as large. When the COPY procedure is used, the copy has the same attributes as the original. Thus a copy of a compressed data set is also compressed. In 6.06, whether or not the copy was compressed depended on the system default setting. DECISION FACTORS Disadvantages The primary disadvantage of using compressed data sets is the extra au required for decompression. Muller et al. (1992) compared au usage for compressed and uncompressed versions of a data set under MVS. They found that using the SORT, FREQ, MEANS, or CORR procedures on the compressed data set could take twice as much au. Realling in the compressed observations using SET took almost four times as much au. Since compressed observations are variable length, instead of fixed length, certain standard access ~ethods are not as effective. No direct access using POINT= is available although a work around is possible. You could create and index a variable with the value of _N_. In general, using FIRST. and LAST. processing will be slower. The additional complexity associated with using compressed data sets is a factor to remember. With variable length records, predicting space needs becomes more difficult although estimating an upper limit is possible. Since a l2-byte overhead per compressed record is required, a 6.07 compressed data set could be larger than its '1ean" counterpart, which only has a l-bit overhead. If au usage is also a concern, then a balancing act is required between using compressed and uncompressed data 1414 sets. One approach may be to compress only permanent data sets. For.end-user!,pplications, compression may cause problems if not tested thoroughly. Advantages There are other advantages to compression besides the obvious one of saving storage space. Note that specifying COMPRESS=YES adds no extra human effort when using a compressed data set since decompression is automatic. In addition,.no loss of precision occurs, which guarantees that results are identical to those based on an uncompressed version of the data. Compared to other efficiency techniques, making practical use of compressed data sets requires a relatively smail amount of programmer effort, either for decision making or execution. Compression can enhance other efficiency strategies such as minimizing storage length for variables dropping unneeded variables using indexes when appropriate. You can minimize the space required for specific variables by using LENGTH<8 or storing categorical values as character. Note that using a shorter LENGTH can result in precision problems, especially when data must be transferred across hardware. platforms. Storing categorical. variables such as SEX. as character may be better t~ just decreasing the LENGTH. Always consider using DROP/KEEP to assure that only required variables remain in temporary or permanent data sets. When you use an index to access observations in a compressed data set, only the required observations are decompressed. So by indexing a compressed data set, the amount of extra CPU required for decompression may decrease substantially. Combining any or all of these techniques with compression can efficiently save the most storage space.
4 Figure 2. Thin Versus Wide Space Savings 40 Percent 30 Improvement Short Long --Thin--- Short Long ---Wide-- EXAMPLES Example! As shown in Figure 2, Muller et al. confinned that wide data sets benefit more from compression. They created test data sets under MVS. Short meant 10,000 observations compared to 100,000 in the long data sets. Thin data sets included 20 variables while wide ones contained 110 variables. Regardless of the number of observations, about 45% savings in space occurred, for the wide data sets, which was more than twice as much as the 20% savings for thin data sets. Example 2 Bardsley demonstrated the effect of the compression algorithm by creating data sets emphasizing specific characteristics., The results under AI)( were, comparable to MVS. Wide data sets contained 120 variables. The results by type of value are: Type Numeric Real Single-digit Integer Numeric Missing Character Missing Percent Savings o As expected, data sets with a greater occurrence of repeated bytes benefit the most. Example 3 The table in Figure 3 shows space savings for actual clinical trial data sets under MVS. The data sets are shown sorted by observation length. In ADVERSE and MEDS, the character variables include a large number of blanks. ADVERSE is only half character variables, but they account for 83% of the storage space in fixed length records. Most of the values in EFFBASE are zeros or ones. In general, the wider data sets compressed more effectively with the Figure 3. Space Savings in Clinical Trials Number Record % Character % Space Data set of Variables Length Vars Length Saved 1) VITALS ) EFFSUMS ) ADVERSE ) MEDS' ) EFFBASE ) LABS
5 exception of LABS. However, this is not a surprise since lab data consists mostly of real numbers. CONCLUSION Generally, the trade offs are clear when evaluating whether or not to compress data sets. You must choose between the importance of saving space versus using the least amount of CPU, assuming both save money. The chacteristics of obvious compression situations include any or all of the following: not enough space available wide observations with repeated values "blank" character variables expected infrequent processing of large data sets subset processing of large indexed data sets >50% savings derr.onstrated. In any Release 6.07 computing environment, you can benefit from using the COMPRESS: option as long as you compress data sets deliberately. RECOMMENDED READING Compression Bardsley, P., (1993), "Space-saving Tools: Compression in SAS 6.07," Proc. of the First Annual Southeast SAS Users Group Conference, Cary, NC: SAS Institute Inc., Beatrous, S. and Stokes, J.T., (1992), "I/O Performance Improvements in Release 6.07 of the SAS System under MVS, CMS, and VMS," Proc. of the Seuenteenth Annual SAS Group Inti. Conference, Cary, NC: SAS Institute Inc., Clifford, W., Beatrous, S., Stokes, J.T., and Mosmon, K. (1989), "Using New SAS Database Features and Options," Proc. of the Fourteenth Annual SAS Group IntI. Conference, Cary, NC: SAS Institute Inc., Muller, S.M., Hardy, K.A., and Brown, K.J., (1992), "Getting It And Keeping It With SAS Software," Proc. of the Seuenteenth Annual SAS Group IntI. Conference, Cary,NC: SAS Institute Inc., SAS Institute Inc. (1990), SAS Companion for the VMS Environment, Version 6, First Edition, Cary, NC: SAS Institute Inc. (Compression related topics on pp. 137, 141,143,345,408.) SAS Institute Inc. (1990), SAS Language: Reference, Version 6, First Edition, Cary, NC: SAS Institute Inc. COMPRESS: option, pp ; REUSE= option, pp Efficiency Techniques Howard, N. (1991), "Efficiency Techniques for Improving I/O and Processing Time in the DATA Step," Proc. of the Fourteenth Annual SAS Group IntI. Conference, Cary, NC: SAS Institute Inc., Ma, J.M. (1991), "Effieency Revisited: Large Files and Release 6.06," Proc. of the Sixteenth Annual SAS Group IntI. Conference, Cary, NC: SAS Institute Inc., Mackiernan, Y.D. (1989), "Don't Do Anything You Don't Have To: Elementary Strategies For Processing Large Data Sets With SAS Software," Proc. of the Fourteenth Annual SAS Group IntI. Conference, Cary, NC: SAS Institute Inc., Muller, K.E., Smith, J., and Bass, J. (1982), "Managing 'not small' Datasets in a Research Environment:' Proc. of the Seuenth Annual SAS Users Group IntI. Conference, Cary, NC: SAS Institute Inc., ' SAS Institute Inc. (1990), SAS Programming Tips: A Guide to Efficient SAS Processing, Cary, NC: SAS' Institute Inc. Smith, u. (1991), "Effieent Use of Numeric and Character Data Types," Proc. of the Sixteenth Annual SAS Group Inti. Conference, Cary, NC: SAS Institute Inc., CONTACT ADDRESS Dr. J. Meimei Ma Quintiles P. O. Box Research Triangle Park, NC SAS is a registered trademark of SAS InstibJte Inc. in the USA and other countries. lid indicates USA registra~on. 1416
Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX
1/0 Performance Improvements in Release 6.07 of the SAS System under MVS, ems, and VMS' Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX INTRODUCTION The
More informationAPPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software
177 APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software Authors 178 Abstract 178 Overview 178 The SAS Data Library Model 179 How Data Flows When You Use SAS Files 179 SAS Data Files 179
More informationEfficiency Ideas For Large Files
Efficiency Ideas For Large Files J. Meimei Ma, Quintiles, Research Triangle Park, NC Andrew H. Karp, Sierra Information Services, Inc., San Francisco, CA INTRODUCTION This tutorial presents options you
More informationAndrew H. Karp Sierra Information Services, Inc. San Francisco, California USA
Indexing and Compressing SAS Data Sets: How, Why, and Why Not Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Many users of SAS System software, especially those working
More informationMacros I Use Every Day (And You Can, Too!)
Paper 2500-2018 Macros I Use Every Day (And You Can, Too!) Joe DeShon ABSTRACT SAS macros are a powerful tool which can be used in all stages of SAS program development. Like most programmers, I have collected
More informationCheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians
Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians ABSTRACT Karthik Chidambaram, Senior Program Director, Data Strategy, Genentech, CA This paper will provide tips and techniques
More informationChapter 1. Introduction to Indexes
Chapter 1 Introduction to Indexes The Index Concept 2 The Index as a SAS Performance Tool 2 Types of SAS Applications That May Benefit from Indexes 4 How SAS Indexes Are Structured 4 Types of SAS Indexes
More informationStatistics, Data Analysis & Econometrics
ST009 PROC MI as the Basis for a Macro for the Study of Patterns of Missing Data Carl E. Pierchala, National Highway Traffic Safety Administration, Washington ABSTRACT The study of missing data patterns
More informationEfficient Use of SAS' Data Set Indexes in SAS' Applications
Efficient Use of SAS' Data Set Indexes in SAS' Applications Sally Painter, SAS Institute Inc., Cary, NC ABSTRACT By indexing your SAS data sets, you can run certain types of apptications more efficiently.
More informationQuality Control of Clinical Data Listings with Proc Compare
ABSTRACT Quality Control of Clinical Data Listings with Proc Compare Robert Bikwemu, Pharmapace, Inc., San Diego, CA Nicole Wallstedt, Pharmapace, Inc., San Diego, CA Checking clinical data listings with
More informationSOS (Save Our Space) Matters of Size
SOS (Save Our Space) Matters of Size By Matthew Pearce Amadeus Software Limited 2001 Abstract Disk space is one of the most critical issues when handling large amounts of data. Large data means greater
More informationOptimizing System Performance
243 CHAPTER 19 Optimizing System Performance Definitions 243 Collecting and Interpreting Performance Statistics 244 Using the FULLSTIMER and STIMER System Options 244 Interpreting FULLSTIMER and STIMER
More informationSYSTEM 2000 Essentials
7 CHAPTER 2 SYSTEM 2000 Essentials Introduction 7 SYSTEM 2000 Software 8 SYSTEM 2000 Databases 8 Database Name 9 Labeling Data 9 Grouping Data 10 Establishing Relationships between Schema Records 10 Logical
More informationThe DATA Statement: Efficiency Techniques
The DATA Statement: Efficiency Techniques S. David Riba, JADE Tech, Inc., Clearwater, FL ABSTRACT One of those SAS statements that everyone learns in the first day of class, the DATA statement rarely gets
More informationFSEDIT Procedure Windows
25 CHAPTER 4 FSEDIT Procedure Windows Overview 26 Viewing and Editing Observations 26 How the Control Level Affects Editing 27 Scrolling 28 Adding Observations 28 Entering and Editing Variable Values 28
More informationBatch vs. Interactive: Why You Need Both Janet E. Stuelpner. ASG. Inc Cary. North Carolina
Batch vs. Interactive: Why You Need Both Janet E. Stuelpner. ASG. Inc Cary. North Carolina ABSTRACT error was small fa semi-colon was omitted or a closing quotation mark was missing), but caused the program
More informationPharmaSUG Paper BB01
PharmaSUG 2014 - Paper BB01 Indexing: A powerful technique for improving efficiency Arun Raj Vidhyadharan, inventiv Health, Somerset, NJ Sunil Mohan Jairath, inventiv Health, Somerset, NJ ABSTRACT The
More informationSquare Peg, Square Hole Getting Tables to Fit on Slides in the ODS Destination for PowerPoint
PharmaSUG 2018 - Paper DV-01 Square Peg, Square Hole Getting Tables to Fit on Slides in the ODS Destination for PowerPoint Jane Eslinger, SAS Institute Inc. ABSTRACT An output table is a square. A slide
More informationImelda C. Go, South Carolina Department of Education, Columbia, SC
PO 082 Rounding in SAS : Preventing Numeric Representation Problems Imelda C. Go, South Carolina Department of Education, Columbia, SC ABSTRACT As SAS programmers, we come from a variety of backgrounds.
More informationIntegers. N = sum (b i * 2 i ) where b i = 0 or 1. This is called unsigned binary representation. i = 31. i = 0
Integers So far, we've seen how to convert numbers between bases. How do we represent particular kinds of data in a certain (32-bit) architecture? We will consider integers floating point characters What
More informationUsing Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY
Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY ABSTRACT Data set options are an often over-looked feature when querying and manipulating SAS
More informationMy grandfather was an Arctic explorer,
Explore the possibilities A Teradata Certified Master answers readers technical questions. Carrie Ballinger Senior database analyst Teradata Certified Master My grandfather was an Arctic explorer, and
More informationSAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board
SAS PROGRAM EFFICIENCY FOR BEGINNERS Bruce Gilsen, Federal Reserve Board INTRODUCTION This paper presents simple efficiency techniques that can benefit inexperienced SAS software users on all platforms.
More informationSAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board
SAS PROGRAM EFFICIENCY FOR BEGINNERS Bruce Gilsen, Federal Reserve Board INTRODUCTION This paper presents simple efficiency techniques that can benefit inexperienced SAS software users on all platforms.
More informationBruce Gilsen, Federal Reserve Board
SAS PROGRAM EFFICIENCY FOR BEGINNERS Bruce Gilsen, Federal Reserve Board INTRODUCTION This paper presents simple efficiency techniques that can benefit inexperienced SAS software users on all platforms
More informationCleaning up your SAS log: Note Messages
Paper 9541-2016 Cleaning up your SAS log: Note Messages ABSTRACT Jennifer Srivastava, Quintiles Transnational Corporation, Durham, NC As a SAS programmer, you probably spend some of your time reading and
More informationCharacteristics of a "Successful" Application.
Characteristics of a "Successful" Application. Caroline Bahler, Meridian Software, Inc. Abstract An application can be judged "successful" by two different sets of criteria. The first set of criteria belongs
More informationGetting it Done with PROC TABULATE
ABSTRACT Getting it Done with PROC TABULATE Michael J. Williams, ICON Clinical Research, San Francisco, CA The task of displaying statistical summaries of different types of variables in a single table
More informationFormats. Formats Under UNIX. HEXw. format. $HEXw. format. Details CHAPTER 11
193 CHAPTER 11 Formats Formats Under UNIX 193 Formats Under UNIX This chapter describes SAS formats that have behavior or syntax that is specific to UNIX environments. Each format description includes
More informationSAS Macros for Grouping Count and Its Application to Enhance Your Reports
SAS Macros for Grouping Count and Its Application to Enhance Your Reports Shi-Tao Yeh, EDP Contract Services, Bala Cynwyd, PA ABSTRACT This paper provides two SAS macros, one for one grouping variable,
More informationUtilizing the Stored Compiled Macro Facility in a Multi-user Clinical Trial Setting
Paper AD05 Utilizing the Stored Compiled Macro Facility in a Multi-user Clinical Trial Setting Mirjana Stojanovic, Duke University Medical Center, Durham, NC Dorothy Watson, Duke University Medical Center,
More informationPaper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation
Paper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation ABSTRACT Data that contain multiple observations per case are called repeated measures
More informationMerge Processing and Alternate Table Lookup Techniques Prepared by
Merge Processing and Alternate Table Lookup Techniques Prepared by The syntax for data step merging is as follows: International SAS Training and Consulting This assumes that the incoming data sets are
More informationAn Introduction to SAS/FSP Software Terry Fain, RAND, Santa Monica, California Cyndie Gareleck, RAND, Santa Monica, California
An Introduction to SAS/FSP Software Terry Fain, RAND, Santa Monica, California Cyndie Gareleck, RAND, Santa Monica, California ABSTRACT SAS/FSP is a set of procedures used to perform full-screen interactive
More informationNow That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables?
Paper SAS 1866-2015 Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables? Steven Sober, SAS Institute Inc. ABSTRACT Well, Hadoop community, now that you have your data
More informationAn Animated Guide: Proc Transpose
ABSTRACT An Animated Guide: Proc Transpose Russell Lavery, Independent Consultant If one can think about a SAS data set as being made up of columns and rows one can say Proc Transpose flips the columns
More informationSimple Rules to Remember When Working with Indexes
Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, CA Abstract SAS users are always interested in learning techniques related to improving
More informationSAS Job Monitor 2.2. About SAS Job Monitor. Overview. SAS Job Monitor for SAS Data Integration Studio
SAS Job Monitor 2.2 About SAS Job Monitor Overview SAS Job Monitor is a component of SAS Environment Manager that integrates information from SAS Data Integration Studio, DataFlux Data Management Server,
More informationPROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING
PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING Karuna Nerurkar and Andrea Robertson, GMIS Inc. ABSTRACT Proc Format can be a useful tool for improving programming efficiency. This paper
More informationTopic C. Communicating the Precision of Measured Numbers
Topic C. Communicating the Precision of Measured Numbers C. page 1 of 14 Topic C. Communicating the Precision of Measured Numbers This topic includes Section 1. Reporting measurements Section 2. Rounding
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationThe Ins and Outs of %IF
Paper 1135-2017 The Ins and Outs of %IF M. Michelle Buchecker, ThotWave Technologies, LLC. ABSTRACT Have you ever had your macro code not work and you couldn't figure out why? Even something as simple
More informationA Guided Tour Through the SAS Windowing Environment Casey Cantrell, Clarion Consulting, Los Angeles, CA
A Guided Tour Through the SAS Windowing Environment Casey Cantrell, Clarion Consulting, Los Angeles, CA ABSTRACT The SAS system running in the Microsoft Windows environment contains a multitude of tools
More informationPH006 Audit Trails of SAS Data Set Changes An Overview Maria Y. Reiss, Wyeth Pharmaceuticals, Collegeville, PA
PH006 Audit Trails of SAS Data Set Changes An Overview Maria Y. Reiss, Wyeth, Collegeville, PA ABSTRACT SAS programmers often have to modify data in SAS data sets. When modifying data, it is desirable
More informationA Quick and Gentle Introduction to PROC SQL
ABSTRACT Paper B2B 9 A Quick and Gentle Introduction to PROC SQL Shane Rosanbalm, Rho, Inc. Sam Gillett, Rho, Inc. If you are afraid of SQL, it is most likely because you haven t been properly introduced.
More informationScalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX
Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX ABSTRACT Symmetric multiprocessor (SMP) computers can increase performance by reducing the time required to analyze large volumes
More informationReal Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers
Real Numbers We have been studying integer arithmetic up to this point. We have discovered that a standard computer can represent a finite subset of the infinite set of integers. The range is determined
More informationTen tips for efficient SAS code
Ten tips for efficient SAS code Host Caroline Scottow Presenter Peter Hobart Managing the webinar In Listen Mode Control bar opened with the white arrow in the orange box Efficiency Overview Optimisation
More informationBeginner Beware: Hidden Hazards in SAS Coding
ABSTRACT SESUG Paper 111-2017 Beginner Beware: Hidden Hazards in SAS Coding Alissa Wise, South Carolina Department of Education New SAS programmers rely on errors, warnings, and notes to discover coding
More informationPerformance Considerations
149 CHAPTER 6 Performance Considerations Hardware Considerations 149 Windows Features that Optimize Performance 150 Under Windows NT 150 Under Windows NT Server Enterprise Edition 4.0 151 Processing SAS
More informationData Set Options CHAPTER 2
5 CHAPTER 2 Data Set Options Definition 6 6 Using Data Set Options 6 Using Data Set Options with Input or Output SAS Data Sets 6 How Data Set Options Interact with System Options 7 Data Set Options by
More informationData Set Options. Specify a data set option in parentheses after a SAS data set name. To specify several data set options, separate them with spaces.
23 CHAPTER 4 Data Set Options Definition 23 Syntax 23 Using Data Set Options 24 Using Data Set Options with Input or Output SAS Data Sets 24 How Data Set Options Interact with System Options 24 Data Set
More informationAn Introduc+on to Computers and Java CSC 121 Spring 2017 Howard Rosenthal
An Introduc+on to Computers and Java CSC 121 Spring 2017 Howard Rosenthal Lesson Goals Learn the basic terminology of a computer system Understand the basics of high level languages, including Java Understand
More informationUsing SAS/SHARE More Efficiently
Using More Efficiently by Philip R Holland, Holland Numerics Ltd, UK Abstract is a very powerful product which allow concurrent access to SAS Datasets for reading and updating. However, if not used with
More informationWORKSHOP: Using the Health Survey for England, 2014
WORKSHOP: Using the Health Survey for England, 2014 There are three sections to this workshop, each with a separate worksheet. The worksheets are designed to be accessible to those who have no prior experience
More informationReal-Time Standards (RTS) Version 4.10 General Information Manual
Real-Time Standards (RTS) Version 4.10 General Information Manual Copyright 2008 INTERCHIP AG Page 1 of 8 RTS GENERAL INFORMATION MANUAL This Manual introduces the highlights of RTS, and provides an overview
More informationSAS 101. Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23. By Tasha Chapman, Oregon Health Authority
SAS 101 Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23 By Tasha Chapman, Oregon Health Authority Topics covered All the leftovers! Infile options Missover LRECL=/Pad/Truncover
More informationVersion 6 and Version 7: A Peaceful Co-Existence Steve Beatrous and James Holman, SAS Institute Inc., Cary, NC
Version 6 and Version 7: A Peaceful Co-Existence Steve Beatrous and James Holman, SAS Institute Inc., Cary, NC Abstract Version 7 represents a major step forward for SAS Institute and is the first release
More informationABSTRACT INTRODUCTION WHERE TO START? 1. DATA CHECK FOR CONSISTENCIES
Developing Integrated Summary of Safety Database using CDISC Standards Rajkumar Sharma, Genentech Inc., A member of the Roche Group, South San Francisco, CA ABSTRACT Most individual trials are not powered
More informationSAS Online Training: Course contents: Agenda:
SAS Online Training: Course contents: Agenda: (1) Base SAS (6) Clinical SAS Online Training with Real time Projects (2) Advance SAS (7) Financial SAS Training Real time Projects (3) SQL (8) CV preparation
More informationGoing Under the Hood: How Does the Macro Processor Really Work?
Going Under the Hood: How Does the Really Work? ABSTRACT Lisa Lyons, PPD, Inc Hamilton, NJ Did you ever wonder what really goes on behind the scenes of the macro processor, or how it works with other parts
More informationProgramming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC
Paper CC-05 Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC ABSTRACT For many SAS users, learning SQL syntax appears to be a significant effort with a low
More informationWeek 6, Week 7 and Week 8 Analyses of Variance
Week 6, Week 7 and Week 8 Analyses of Variance Robyn Crook - 2008 In the next few weeks we will look at analyses of variance. This is an information-heavy handout so take your time reading it, and don
More informationManaging your metadata efficiently - a structured way to organise and frontload your analysis and submission data
Paper TS06 Managing your metadata efficiently - a structured way to organise and frontload your analysis and submission data Kirsten Walther Langendorf, Novo Nordisk A/S, Copenhagen, Denmark Mikkel Traun,
More informationHandling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC
Paper BB-206 Handling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC ABSTRACT Every SAS programmer knows that
More informationMastering the Basics: Preventing Problems by Understanding How SAS Works. Imelda C. Go, South Carolina Department of Education, Columbia, SC
SESUG 2012 ABSTRACT Paper PO 06 Mastering the Basics: Preventing Problems by Understanding How SAS Works Imelda C. Go, South Carolina Department of Education, Columbia, SC There are times when SAS programmers
More informationIntroduction to SAS. Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC
Introduction to SAS Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC cmurray-krezan@salud.unm.edu 20 August 2018 What is SAS? Statistical Analysis System,
More informationEmpowering the SAS Programmer: Understanding Basic Microsoft Windows Performance Metrics by Customizing the Data Results in SAS/GRAPH Software
Paper SAS406-2014 Empowering the SAS Programmer: Understanding Basic Microsoft Windows Performance Metrics by Customizing the Data Results in SAS/GRAPH Software John Maxwell, SAS Institute Inc. ABSTRACT
More informationIBM 370 Basic Data Types
IBM 370 Basic Data Types This lecture discusses the basic data types used on the IBM 370, 1. Two s complement binary numbers 2. EBCDIC (Extended Binary Coded Decimal Interchange Code) 3. Zoned Decimal
More informationOne-PROC-Away: The Essence of an Analysis Database Russell W. Helms, Ph.D. Rho, Inc.
One-PROC-Away: The Essence of an Analysis Database Russell W. Helms, Ph.D. Rho, Inc. Chapel Hill, NC RHelms@RhoWorld.com www.rhoworld.com Presented to ASA/JSM: San Francisco, August 2003 One-PROC-Away
More informationUSING PROC SQL EFFECTIVELY WITH SAS DATA SETS JIM DEFOOR LOCKHEED FORT WORTH COMPANY
USING PROC SQL EFFECTIVELY WITH SAS DATA SETS JIM DEFOOR LOCKHEED FORT WORTH COMPANY INTRODUCTION This paper is a beginning tutorial on reading and reporting Indexed SAS Data Sets with PROC SQL. Its examples
More informationSAS/ASSIST Software Setup
173 APPENDIX 3 SAS/ASSIST Software Setup Appendix Overview 173 Setting Up Graphics Devices 173 Setting Up Remote Connect Configurations 175 Adding a SAS/ASSIST Button to Your Toolbox 176 Setting Up HTML
More informationA Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA
ABSTRACT: A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA Programmers often need to summarize data into tables as per template. But study
More informationGetting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA
SESUG 2012 Paper HW-01 Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA ABSTRACT Learning the basics of PROC REPORT can help the new SAS user avoid hours of headaches.
More informationPC and Windows Installation 32 and 64 bit Operating Systems
SUDAAN Installation Guide PC and Windows Installation 32 and 64 bit Operating Systems Release 11.0.1 Copyright 2013 by RTI International P.O. Box 12194 Research Triangle Park, NC 27709 All rights reserved.
More informationCluster Randomization Create Cluster Means Dataset
Chapter 270 Cluster Randomization Create Cluster Means Dataset Introduction A cluster randomization trial occurs when whole groups or clusters of individuals are treated together. Examples of such clusters
More informationCS15100 Lab 7: File compression
C151 Lab 7: File compression Fall 26 November 14, 26 Complete the first 3 chapters (through the build-huffman-tree function) in lab (optionally) with a partner. The rest you must do by yourself. Write
More informationNumeric Precision 101
www.sas.com > Service and Support > Technical Support TS Home Intro to Services News and Info Contact TS Site Map FAQ Feedback TS-654 Numeric Precision 101 This paper is intended as a basic introduction
More informationDefining Test Data Using Population Analysis Clarence Wm. Jackson, CQA - City of Dallas CIS
Defining Test Data Using Population Analysis Clarence Wm. Jackson, CQA - City of Dallas CIS Abstract Defining test data that provides complete test case coverage requires the tester to accumulate data
More informationGuidance for building Study and CRF in OpenClinica
Guidance for building Study and CRF in OpenClinica 1. Use of Patient Identifying information Patient Identifying Data (PID) is any data within clinical data that could potentially be used to identify subjects,
More informationLecture 1 Getting Started with SAS
SAS for Data Management, Analysis, and Reporting Lecture 1 Getting Started with SAS Portions reproduced with permission of SAS Institute Inc., Cary, NC, USA Goals of the course To provide skills required
More informationGary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY
Table Lookups in the SAS Data Step Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY Introduction - What is a Table Lookup? You have a sales file with one observation for
More informationInput Space Partitioning
CMPT 473 Software Quality Assurance Input Space Partitioning Nick Sumner Recall Testing involves running software and comparing observed behavior against expected behavior Select an input, look at the
More informationPaper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.
Paper 76-28 Comparative Efficiency of SQL and Base Code When Reading from Database Tables and Existing Data Sets Steven Feder, Federal Reserve Board, Washington, D.C. ABSTRACT In this paper we compare
More informationTape Drive Data Compression Q & A
Tape Drive Data Compression Q & A Question What is data compression and how does compression work? Data compression permits increased storage capacities by using a mathematical algorithm that reduces redundant
More informationDesign Issues 1 / 36. Local versus Global Allocation. Choosing
Design Issues 1 / 36 Local versus Global Allocation When process A has a page fault, where does the new page frame come from? More precisely, is one of A s pages reclaimed, or can a page frame be taken
More informationPaper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA
ABSTRACT Paper 236-28 An Automated Reporting Macro to Create Cell Index An Enhanced Revisit When generating tables from SAS PROC TABULATE or PROC REPORT to summarize data, sometimes it is necessary to
More informationMachine Architecture and Number Systems CMSC104. Von Neumann Machine. Major Computer Components. Schematic Diagram of a Computer. First Computer?
CMSC104 Lecture 2 Remember to report to the lab on Wednesday Topics Machine Architecture and Number Systems Major Computer Components Bits, Bytes, and Words The Decimal Number System The Binary Number
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationIntegrated Safety Reporting Anemone Thalmann elba - GEIGY Ltd (PH3.25), Basel
ntegrated Safety Reporting Anemone Thalmann elba - GEGY Ltd (PH3.25), Basel Abstract: Most of the regulatory health authorities approving pharmaceutical products consider the ntegrated Safety Summary to
More informationImproving Productivity with Parameters
Improving Productivity with Parameters Michael Trull Rocky Brown Thursday, January 25, 2007 Improving Productivity with Parameters Part I The Fundamentals Parameters are variables which define the size
More informationTen Great Reasons to Learn SAS Software's SQL Procedure
Ten Great Reasons to Learn SAS Software's SQL Procedure Kirk Paul Lafler, Software Intelligence Corporation ABSTRACT The SQL Procedure has so many great features for both end-users and programmers. It's
More informationData Compression in Blackbaud CRM Databases
Data Compression in Blackbaud CRM Databases Len Wyatt Enterprise Performance Team Executive Summary... 1 Compression in SQL Server... 2 Perform Compression in Blackbaud CRM Databases... 3 Initial Compression...
More informationGetting Information from a Table
ch02.fm Page 45 Wednesday, April 14, 1999 2:44 PM Chapter 2 Getting Information from a Table This chapter explains the basic technique of getting the information you want from a table when you do not want
More informationTweaking your tables: Suppressing superfluous subtotals in PROC TABULATE
ABSTRACT Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE Steve Cavill, NSW Bureau of Crime Statistics and Research, Sydney, Australia PROC TABULATE is a great tool for generating
More informationSAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite
Paper SAS1952-2015 SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite Jason Shoffner, SAS Institute Inc., Cary, NC ABSTRACT Once you have a SAS Visual
More informationUsing Images in FF&EZ within a Citrix Environment
1 Using Images in FF&EZ within a Citrix Environment This document explains how to add images to specifications, and covers the situation where the FF&E database is on a remote server instead of your local
More informationBinary, Hexadecimal and Octal number system
Binary, Hexadecimal and Octal number system Binary, hexadecimal, and octal refer to different number systems. The one that we typically use is called decimal. These number systems refer to the number of
More informationChapter 3 Data Representation
Chapter 3 Data Representation The focus of this chapter is the representation of data in a digital computer. We begin with a review of several number systems (decimal, binary, octal, and hexadecimal) and
More informationHow to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?
Paper 54-25 How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U? Andrew T. Kuligowski Nielsen Media Research Abstract / Introduction S-M-U. Some people will see these three letters and
More information