Will Your Data Warehouse Stand the Test of rime? David Annis, Amadeus Data Processing, Germany
|
|
- Janice Carter
- 5 years ago
- Views:
Transcription
1 Will Your Data Warehouse Stand the Test of rime? David Annis, Amadeus Data Processing, Germany As storage becomes cheaper, we have to be more careful rather than less careful about how we design our historical databases. Whatever the style or architecture of your data, and whether or not it conforms to the standards for a data warehouse, we all benefit from the increased awareness amongst users and managers of 'Data Warehousing' as a concept. because as the demand for information delivery becomes greater, there is a good chance that you will be given enough disk space to keep several years' worth of data on-line, thus avoiding the administrative overhead of archiving and restoring old information. With this benefit, however, comes an added responsibility to design your databases, and the update procedures in such a way that they will still be feasible in several years' time. In this paper,l will examine two aspects of dealing with large amounts of historical data that will probably be common to all business areas: 1. The nightly update - getting yesterday's information into the warehouse. 2. Coding systems - storing and accessing coded information. Amadeus Amadeus is a global computerised reservation system that provides centralised access to travel services including airlines, cars and hotels. Within the Capacity and Performance Department, we have a responsibility to ensure that the systems and network resources are provided at optimum value for money to the customer. As a result, we also have to ensure that our own data analysis procedures do not use resources that would be better allocated to the main production systems. The nightly update The common objectives To format yesterday's business information for inclusion into the database. Ensure no duplication Set up any indexes that are required Assumptions When adding yesterday's data into your database, the objective is make the process as automated and secure as possible. Assuming that all the procedures involved are not perfect, you will need to consider the possibility that the job will have to be re-run, and any data that was added would have to be replaced. You also need to design the database so that frequent queries have a good response time. Typically this means indexing. It is probably safe to assume (or even enforce) that if a user makes a query against a historical database, he/she will be interested in a specific time period. As an example, consider the work flow involved in updating the amadeus network performance database. 919
2 Example Work flow Network Statistics: Network Statistics: Network Statistics: Network Statistics: Resource: SNA Resource: IDG Resource: BBN Resource: HYP Create common Format in SAS Datasets Combine J r- I-- FAdd to ~ ( Database "./ This represents a fairly common type of work flow: different sources of business information being combined and perhaps summarised before being added into the main database(s}. In our.case, all the automated checks for data validity cannot guarantee that a quick read of the daily report will not show that the process needs to be restarted from an earlier step. As a result, the process of adding to the database needs to ensure that yesterday's data is not simply duplicated. Example Methods The following example methods are extracts of code which could be used to add a daily dataset (TODAY) to a historical dataset (MAIN. CUSTOMER). 920
3 Method 1: Using SET. %let date=today()-l; data MAIN.CUSTOMER; set MAIN.CUSTOMER(where=(date ne &date)) TODAY(where=(date eq &date)); Consequences If any previous data from &DATE was added, it will be replaced. If the newly created data TODAY contains any data from another day, this will not be added. Method 2: Using MERGE. The example assumes one observation per customer per day. data MAIN.CUSTOMER; merge MAIN.CUSTOMER TODAY; by date customer; Consequences If the process is re-run, and the number of customers is different on the second run, some observations from the first run will not be replaced. Disadvantage of Methods 1 and 2. Although the consequences of methods 1 and 2 can be avoided with a little extra code, I am not going to expand on them as they both have one major disadvantage: both methods perform a sequential read of the main dataset. This means that the resources used by the nightly update would grow in a linear way over time. Method 3: Using a unique index. data TODAY(INDEX=(key=(date customer) /UNIQUE)); <more statements> proc append base=main.customer new=today; Consequences You could describe this method as an 'intelligent' version of Method 2 above. Note that in this method also, if the process is re-run, and the number of customers is different on the second run, some observations from the first run will not be replaced. This could be rectified using the MODIFY statement, and removing the observations in place, but this is only advisable if your database is backed up, as a system error in the middle of the data step could destroy your data. This method does have the distinct advantage, however, that the data will be indexed in a useful way. It is useful because almost all queries against a historical database will be subsetting by the date in some way or other. 921
4 f'=-~"="=="-=~ ~~ The other consequences of this method are overheads in disk space and run-time resources. This will be discussed in a later section. Method 4: Using a Journal When designing the amadeus Network Performance Database, the major objective was the automation of the daily update, and ensuring that re-runs and exceptions could be handled simply. As a result, the first approach was to maintain a dataset containing all the dates where an update was performed. In other words, a simple journal dataset. This had the following benefits: Easy to avoid duplication of data. The update can be designed to run as many days as necessary to bring the database up to date. On further examination, it became obvious that this journal dataset could also hold pointers (observation numbers) to the start and end of each day's data in the main database. The number of observations is also held in the journal dataset as a check, and the final structure is presented below. -,68' 'EoSs 'NOSS 'Journal Dataset Main Dataset jun94 I :- 12JUn94:23:45jX L I: r~oo:15 i~l.. j12jun94:23:45ix99999 ~I! This data structure is maintained by the macro %JAPPEND. This in turn uses two further macros which are described briefly below: %nobs finds the number of observations from a dataset and puts the result into a global macro variable. %jgetdate finds the date from a dataset. This works by searching from the middle of the dataset for the first observation with a non-missing date value. The macro is designed in this way mainly for performance data, where the records are likely to be sorted by time, with some data from just before or just after midnight at the beginning or end of the dataset. Note that in our example, this data is discarded. For the sake of readability, I have removed the error handling code from the example overleaf. 922
5 * Macro JAPPEND is intended to append a days worth of data to a history database. The following conditions are required: 1. BASE and NEW datasets must have the same variables 'etc as the append step does not use the FORCE option. 2. A journal dataset must exist with the following variables: DATE, SOBS, EOBS, NOBS.. 3. The NEW dataset must contain a date or datetime variable. EXAMPLE CALL: %jappend(base=base,new=new,dateval=%str(datepart(dt»,journal=journal) ; %macro jappend(base=,new=,journal=,dateval=); %jgetdate (data=&new,dateval=&dateval, result=newdate) %if &newdate=. %then... ERROR... *********************************************************************; * Check in journal for same date : * The delete is done after the append using the _DELETE_ dataset and the DELOBS macrovar.; *********************************************************************; data delete; set &journal; where date=&newdate; %nobs(data=_delete_,macvar=delobs); ************************************************************** * Find the number of observations before and after the append **************************************************************; %nobs(data=&base,macvar=baseobs); proc append base=&base new=&new(where=(&dateval=&newdate»; %nobs(data=&base,macvar=newobs); data newobs; - - date=&newdate; sobs=&baseobs+1; eobs=&newobs; nobs=&newobs-&baseobs; if nobs=q then... ERROR... * If date already on the database, dele~e the data in place *********; %if &delobs>q %then %do; data &base; set delete; - - do end; obno =sobs to eobs; -modify &base point=_obno ;. remove; **** Mark obs as deleted in journal before adding new entry; data &journal; set &journal; if date=&newdate then date=.; %end; **** Add new journal entry ******************************************; proc append force base=&journal new=_newobs_; %mend jappend; 923
6 Queries Using a Journal Of course one disadvantage of developing a non-standard indexing scheme is that the index is not automatically used. The following macro, however, does provide a simple way for making use of the index. Note that I am presenting here a simplified version. The production version of this macro also contains code for the creation of DATASTEP VIEWS, and for the optimisation of subsetting by variables other than the date. The optimisation is achieved by performing a binary search, as it is known that the daily data is sorted before being JAPPENDed to the database. %macro JGET(data=,out=,journal=,dateval=,sdate=,edate=); data &out(drop=sobs eobs nobs); set &journal(where=(&dateval between &sdate and &edate)); do end; obs =sobs to eobs; set &data point=_obs_ output; %mend JGET; Resources One of the main reasons for developing the Journal Index method described above was the intuition that this would save considerable resources when compared to a unique index. Therefore we created some simulations to compare the disk utilisation and run-time utilisation of the two methods. The figures given below are intended as a qualitative comparison of the two methods. The intention is to simulate how the methods would behave over a long period of time. Therefore I have tended to simplify the graphs and avoid using too many figures, as the emphasis is to show the relative trend of the resources used rather than the actual numerical details. Disk Space As you might expect, tests indicated that, when using a unique index, the size of the index file was dependant only on the number of observations, and the number of indexes. As a result, the overhead in terms of percent decreases as the number of non-indexed variables increases. A graph showing this relationship is shown below: all variables are 8 bytes, there are 10,000 observations in the sample dataset. Note that using a non-unique index, another influencing factor is the number of observations per group; but with 10,000 observations, the percentage of overhead was similar to the unique index for all levels of grouping. 924
7 .. ~isk overhead: Unique Index on Two 8 byte variable) 100% 90% 80% ~ r\ C e a. 70% 60%.!E 50% '1:1.. as.r. > 40% 0 30% 20% 10't'fj 0% :\ -...!!,.. - \ f- \ r I~ r ~ f' ~ I- r r r--! r-- []~~~- I,! I I I I I I _1 ~ ~ ~ ~..1 ~ ~ _ No. of Other Variable. By comparison, the disk space overhead for the Journal Index is virtually zero, so the graph effectively shows the percentage savings of the Journal index oyer a standard SAS index. Execution Resources In order to compare the execution resource overhead of using a unique index, we ran some simulations in which we created a daily dataset, containing a date plus one other numeric variable, which both went to.. make up the unique index. We appended this repeatedly to a main dataset, changing the date each time. In the first test, we appended 10,000 observations per day, and repeated this 100 times, so that the main dataset contained 1 million observations by the end of the test. For each iteration, we measured the CPU and 10 utilisation of the append step. The objective was to show whether the resources consumed by the append increased with the size of the base dataset. As a comparison, we did the same test using a unique index, and using a journal. A representation of the results of the CPU measurements is shown overleaf. 925
8 ~PU Utilisation Comparison Append Step Only 1.4, ' := t1 ~ I: o c7i is Journal... Unique L--...L...;...II-- -'-- -.l. ---I o Years In this representation I have extrapolated the results linearly to show the relative performance over a 4 year period. One surprising thing to note, is that the CPU resources used by the unique index are lower until about year 3 in our example (about 11 million observations). This is in fact NOT due to the overhead of creating the journal, as this is negligible: PROC APPEND itself appears to use less CPU with a unique index than with no index at all. Adding more non-indexed variables into the example has the effect of moving the baseline for both techniques, but the convergence point remains the same. In the same example, adding in the resources used by the creation step gives the following results. CPU Utilisation Comparison Create and Append Steps Corrtlined 1.6, : ,-----'---'------\----'-...,..., 1.4 ~ o ~ 1.2 ~ u... Journal... Unique 0.8 '-- L ~ L '-- ---J o 2 Years 3 4 In this case the convergence point is after 1.7 years (6.2 Million observations). 926
9 And finally, the 10 utilisation patterns. 10 Utilisation Q>mparison O'eate and Append Steps Corrbined 300, , "2 o '* ~ ~-----~-----~-----~-----~ o 4 Years In the case of 10, the unique index consumed more 10 resources from day 1, and grew slowly over time. The growth rate in our example extrapolated to approximately 25% over three years. Resources Used by Non-Unique Indexes Note that I have not shown any performance comparisons for non-unique indexes because they can be directly compared to one of the other two methods: Disk space: comparable to a unique index 10 Resources: comparable toa :unique index CPU resources: comparable to the unique index in the creation step, and to a journal index in the append step. Conclusions For theamadi=us Network Performance Database, the journal index method has distinct advantages for the following r~asons: The number of variables in the database is small, so the overheads of an index would be high in percentage terms. I.. The number of key variables required t6 make the index unique would be high. The database is most frequently used for regular reporting of ail network resources within a given date period, so having the database indexed by date only is sufficienrmost of the time. For ad-h09 queries, the date is always used as a subsetting factor" and this makes the response time acceptable in most cases. Within a day's worth of data, the dataset is sorted by resource, allowing other optimisations within the query process (which I have not described here). If; however, your data is not suited to this form of index, it should cert~inly be possible to achieve a low rate \ of growth for the update and query;:-rocess by intelligent use of SAS indexes, or by direct access with the '" POINT = dataset option. 927
10 Coding Systems and Lookup Tables Most of you have probably experienced some of the problems and frustrations that beset knowledge workers who are trying to maintain or query data going a long way back in time. One cause of this might be due to coding systems that have changed over time. As a designer of a data warehouse or historical database, it is wise to consider some additions to Murphy's law when applied to coding systems. Obsolete codes will be re-used to save introducing a new coding system. When the new coding system is introduced, no mapping will be possible to or from the old system. The date that they changed the coding system depends on whom you ask. The new codes look the same as the old codes; they just mean something different. Some of these problems are not within the scope of this paper, and possibly not even within the capabilities of the author to answer. However, some can be alleviated by using date sensitive (or even time sensitive) lookups. This can be achieved using formats. Example: Mapping Network Resources to Customer Within amadeus, network resources are frequently reviewed and modified to provide optimum bandwidth to a customer. In order to produce management reports to show the utilisation of the network by customer, we ideally need to map the line identifier to the customer for any 15 minute interval. Considering that we normally learn of changes after the event, if we were to store the customer as part of the database, we would have to re-process the daily data whenever a change was made. Therefore we maintain a table that looks as follows: Resource Type Resource 10 Effective Date Customer NPSI X JUL94:00:00 Luftahansa NPSI X NOV94:15:15 Air France... NPSI X SEP94:00:45 Iberia,... NPSI X NOV94:15:15 Lufthansa NPSI X MAR95:12:00 SAS NPSI X NOV94:15:15 Iberia NPSI X CT94:18:00 Air France NPSI X NOV94:15:t5 SAS This table is then processed to create a dataset for processing by PROC FORMAT using the CNTLIN option. In this example, a character format called NETUSER is created. 928
11 .~~,---~ ~ data formats; length start end $40 label $40; keep fmtname type start end label eexcl sexcl hlo; set data.netspeed end=last; by restype resid; /* Create enddate from next observation */ if last. resid then enddate= ; else do; next= n +1; set data.netspeed(keep=effdate rename=(effdate=enddate» point=next; end; /* Join the code and the date to make the lookup */ start=restype I I resid I I put (effdate, z12. ) ; end=restypel Iresidl Iput(enddate,z12.); /* Other variables used by PROC FORMAT */ fmtname='netuser'; type='c' ; eexcl='y'; sexcl='n'; /* Include start, exclude end */ label=user; output; /* Default label */ if last then do start='**other**'; end='**other**'; hlo='o' ; label=' '; output; end; return; Once the format is compiled, the statement to find the customer based on the resource and the datetime becomes very simple: I user=put (restype II residllput (datetime, z12.), $ne.tuser.) ; One important thing to note is that the length of the codes (in our case RESTYPE and RESID) should have the same format and length in the lookup table and in the database. Advantages By making your lookup tables date sens~lve, itispossible to avoidrhany of the problems of changing coding systems. At the design l)tage, it is safest to assume that all codes and coding systems may change within the lifetime of your data warehouse. Of course, the length of the code may also Change, so it is wise to leave some free space within t~e variable to account for this.. If (or when!) codes change, simplyad~ these codes into your lookup table with a new effective date, run the format compile as shown above, and the change is reflected automati ally from the new date, :with no change necessary in any programs. ' ;..:1 929
12 Summary As a data warehouse or performance database designer two years ago, you were probably restricted in disk space to about two year's worth of data. Now it appears that the demand for historical information is growing at the same pace as the data itself. As designers today, therefore, it is olir responsibility to ensure that the processes for updating and retrieving the information will not be the limiting factor in supplying this demand. In short, we have to ensure that our data warehouses will stand the test of Ume. BAS is a registered trademark of SAS Institute Inc., Cary, NC, USA Acknowledements: The idea for the journal index was inspired by a conversation with my colleague Sean Chaffee. \ 930
SAS Scalable Performance Data Server 4.3
Scalability Solution for SAS Dynamic Cluster Tables A SAS White Paper Table of Contents Introduction...1 Cluster Tables... 1 Dynamic Cluster Table Loading Benefits... 2 Commands for Creating and Undoing
More informationMerge Processing and Alternate Table Lookup Techniques Prepared by
Merge Processing and Alternate Table Lookup Techniques Prepared by The syntax for data step merging is as follows: International SAS Training and Consulting This assumes that the incoming data sets are
More informationPROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING
PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING Karuna Nerurkar and Andrea Robertson, GMIS Inc. ABSTRACT Proc Format can be a useful tool for improving programming efficiency. This paper
More informationSingle-pass restore after a media failure. Caetano Sauer, Goetz Graefe, Theo Härder
Single-pass restore after a media failure Caetano Sauer, Goetz Graefe, Theo Härder 20% of drives fail after 4 years High failure rate on first year (factory defects) Expectation of 50% for 6 years https://www.backblaze.com/blog/how-long-do-disk-drives-last/
More informationGary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY
Table Lookups in the SAS Data Step Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY Introduction - What is a Table Lookup? You have a sales file with one observation for
More informationSAS System Powers Web Measurement Solution at U S WEST
SAS System Powers Web Measurement Solution at U S WEST Bob Romero, U S WEST Communications, Technical Expert - SAS and Data Analysis Dale Hamilton, U S WEST Communications, Capacity Provisioning Process
More information!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?
Chapter 10: Virtual Memory Questions? CSCI [4 6] 730 Operating Systems Virtual Memory!! What is virtual memory and when is it useful?!! What is demand paging?!! When should pages in memory be replaced?!!
More information50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas
Paper 103-26 50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas ABSTRACT When you need to join together two datasets, how do
More informationcapabilities and their overheads are therefore different.
Applications Development 3 Access DB2 Tables Using Keylist Extraction Berwick Chan, Kaiser Permanente, Oakland, Calif Raymond Wan, Raymond Wan Associate Inc., Oakland, Calif Introduction The performance
More informationCSIT5300: Advanced Database Systems
CSIT5300: Advanced Database Systems L08: B + -trees and Dynamic Hashing Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong SAR,
More informationIntroduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana
Introduction to Indexing 2 Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana Indexed Sequential Access Method We have seen that too small or too large an index (in other words too few or too
More informationELTMaestro for Spark: Data integration on clusters
Introduction Spark represents an important milestone in the effort to make computing on clusters practical and generally available. Hadoop / MapReduce, introduced the early 2000s, allows clusters to be
More informationOptimizing Testing Performance With Data Validation Option
Optimizing Testing Performance With Data Validation Option 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
More informationPaper SAS Managing Large Data with SAS Dynamic Cluster Table Transactions Guy Simpson, SAS Institute Inc., Cary, NC
Paper SAS255-2014 Managing Large Data with SAS Dynamic Cluster Table Transactions Guy Simpson, SAS Institute Inc., Cary, NC ABSTRACT Today's business needs require 24/7 access to your data in order to
More informationFile System Interface: Overview. Objective. File Concept UNIT-IV FILE SYSTEMS
UNIT-IV FILE SYSTEMS File System Interface: File Concept Access Methods Directory Structure File System Mounting Protection Overview For most users, the file system is the most visible aspect of an operating
More informationSAS IT Resource Management Forecasting. Setup Specification Document. A SAS White Paper
SAS IT Resource Management Forecasting Setup Specification Document A SAS White Paper Table of Contents Introduction to SAS IT Resource Management Forecasting... 1 Getting Started with the SAS Enterprise
More informationMaking do with less: Emulating Dev/Test/Prod and Creating User Playpens in SAS Data Integration Studio and SAS Enterprise Guide
Paper 419 2013 Making do with less: Emulating Dev/Test/Prod and Creating User Playpens in SAS Data Integration Studio and SAS Enterprise Guide David Kratz, d-wise Technologies ABSTRACT Have you ever required
More informationBuilding a Data Warehouse with SAS Software in the Unix Environment
Building a Data Warehouse with SAS Software in the Unix Environment Karen Grippo, Dun & Bradstreet, Basking Ridge, NJ John Chen, Dun & Bradstreet, Basking Ridge, NJ Lisa Brown, SAS Institute Inc., Cary,
More information1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.
Abstract PaperA03-2007 Table Lookups...You Want Performance? Rob Rohrbough, Rohrbough Systems Design, Inc. Presented to the Midwest SAS Users Group Monday, October 29, 2007 Paper Number A3 Over the years
More informationGraph Structure Over Time
Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines
More informationEffects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex
Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex Keiko I. Powers, Ph.D., J. D. Power and Associates, Westlake Village, CA ABSTRACT Discrete time series
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationHow to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?
Paper 54-25 How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U? Andrew T. Kuligowski Nielsen Media Research Abstract / Introduction S-M-U. Some people will see these three letters and
More information10 Things to expect from a DB2 Cloning Tool
10 Things to expect from a DB2 Cloning Tool This document gives a brief overview of functionalities that can be expected from a modern DB2 cloning tool. The requirement to copy DB2 data becomes more and
More informationComparison of different ways using table lookups on huge tables
PhUSE 007 Paper CS0 Comparison of different ways using table lookups on huge tables Ralf Minkenberg, Boehringer Ingelheim Pharma GmbH & Co. KG, Ingelheim, Germany ABSTRACT In many application areas the
More informationAbstract. Background. Summary of method. Using SAS to determine file and space usage in UNIX. Title: Mike Montgomery [MIS Manager, MTN (South Africa)]
Title: Author: Using SAS to determine file and space usage in UNIX Mike Montgomery [MIS Manager, MTN (South Africa)] Abstract The paper will show tools developed to manage a proliferation of SAS files
More informationSurfing the SAS cache
Surfing the SAS cache to improve optimisation Michael Thompson Department of Employment / Quantam Solutions Background Did first basic SAS course in 1989 Didn t get it at all Actively avoided SAS programing
More informationAPPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software
177 APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software Authors 178 Abstract 178 Overview 178 The SAS Data Library Model 179 How Data Flows When You Use SAS Files 179 SAS Data Files 179
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationSpeed Dating: Looping Through a Table Using Dates
Paper 1645-2014 Speed Dating: Looping Through a Table Using Dates Scott Fawver, Arch Mortgage Insurance Company, Walnut Creek, CA ABSTRACT Have you ever needed to use dates as values to loop through a
More informationChapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationDavid S. Septoff Fidia Pharmaceutical Corporation
UNLIMITING A LIMITED MACRO ENVIRONMENT David S. Septoff Fidia Pharmaceutical Corporation ABSTRACT The full Macro facility provides SAS users with an extremely powerful programming tool. It allows for conditional
More informationSAS Scalable Performance Data Server 4.3 TSM1:
: Parallel Join with Enhanced GROUP BY Processing A SAS White Paper Table of Contents Introduction...1 Parallel Join Coverage... 1 Parallel Join Execution... 1 Parallel Join Requirements... 5 Tables Types
More informationWhy Hash? Glen Becker, USAA
Why Hash? Glen Becker, USAA Abstract: What can I do with the new Hash object in SAS 9? Instead of focusing on How to use this new technology, this paper answers Why would I want to? It presents the Big
More informationParallelizing Windows Operating System Services Job Flows
ABSTRACT SESUG Paper PSA-126-2017 Parallelizing Windows Operating System Services Job Flows David Kratz, D-Wise Technologies Inc. SAS Job flows created by Windows operating system services have a problem:
More informationBASICS BEFORE STARTING SAS DATAWAREHOSING Concepts What is ETL ETL Concepts What is OLAP SAS. What is SAS History of SAS Modules available SAS
SAS COURSE CONTENT Course Duration - 40hrs BASICS BEFORE STARTING SAS DATAWAREHOSING Concepts What is ETL ETL Concepts What is OLAP SAS What is SAS History of SAS Modules available SAS GETTING STARTED
More informationIBM 3850-Mass storage system
BM 385-Mass storage system by CLAYTON JOHNSON BM Corporation Boulder, Colorado SUMMARY BM's 385, a hierarchical storage system, provides random access to stored data with capacity ranging from 35 X 1()9
More informationNO MORE MERGE. Alternative Table Lookup Techniques
NO MORE MERGE. Alternative Table Lookup Techniques Dana Rafiee, Destiny Corporation/DDISC Group Ltd. U.S., Wethersfield, CT ABSTRACT This tutorial is designed to show you several techniques available for
More informationExtra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University
Extra: B+ Trees CS1: Java Programming Colorado State University Slides by Wim Bohm and Russ Wakefield 1 Motivations Many times you want to minimize the disk accesses while doing a search. A binary search
More informationINTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS
TO SAS NEED FOR SAS WHO USES SAS WHAT IS SAS? OVERVIEW OF BASE SAS SOFTWARE DATA MANAGEMENT FACILITY STRUCTURE OF SAS DATASET SAS PROGRAM PROGRAMMING LANGUAGE ELEMENTS OF THE SAS LANGUAGE RULES FOR SAS
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationChapter 12: Indexing and Hashing. Basic Concepts
Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition
More informationThe SERVER Procedure. Introduction. Syntax CHAPTER 8
95 CHAPTER 8 The SERVER Procedure Introduction 95 Syntax 95 Syntax Descriptions 96 Examples 101 ALLOCATE SASFILE Command 101 Syntax 101 Introduction You invoke the SERVER procedure to start a SAS/SHARE
More informationOptimizing System Performance
243 CHAPTER 19 Optimizing System Performance Definitions 243 Collecting and Interpreting Performance Statistics 244 Using the FULLSTIMER and STIMER System Options 244 Interpreting FULLSTIMER and STIMER
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Basic Steps in Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation 12.2
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationSOS (Save Our Space) Matters of Size
SOS (Save Our Space) Matters of Size By Matthew Pearce Amadeus Software Limited 2001 Abstract Disk space is one of the most critical issues when handling large amounts of data. Large data means greater
More informationAn exercise in separating client-specific parameters from your program
An exercise in separating client-specific parameters from your program Erik Tilanus The Netherlands WIILSU 2015 Milwaukee Do you recognize this? You write a 'one-time' program for one particular situation
More informationPROGRAMMING ROLLING REGRESSIONS IN SAS MICHAEL D. BOLDIN, UNIVERSITY OF PENNSYLVANIA, PHILADELPHIA, PA
PROGRAMMING ROLLING REGRESSIONS IN SAS MICHAEL D. BOLDIN, UNIVERSITY OF PENNSYLVANIA, PHILADELPHIA, PA ABSTRACT SAS does not have an option for PROC REG (or any of its other equation estimation procedures)
More informationNetsweeper Reporter Manual
Netsweeper Reporter Manual Version 2.6.25 Reporter Manual 1999-2008 Netsweeper Inc. All rights reserved. Netsweeper Inc. 104 Dawson Road, Guelph, Ontario, N1H 1A7, Canada Phone: +1 519-826-5222 Fax: +1
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More informationHP Dynamic Deduplication achieving a 50:1 ratio
HP Dynamic Deduplication achieving a 50:1 ratio Table of contents Introduction... 2 Data deduplication the hottest topic in data protection... 2 The benefits of data deduplication... 2 How does data deduplication
More informationHow TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.
6.172 How Fractal Trees Work 1 How TokuDB Fractal TreeTM Indexes Work Bradley C. Kuszmaul Guest Lecture in MIT 6.172 Performance Engineering, 18 November 2010. 6.172 How Fractal Trees Work 2 I m an MIT
More informationBinary Encoded Attribute-Pairing Technique for Database Compression
Binary Encoded Attribute-Pairing Technique for Database Compression Akanksha Baid and Swetha Krishnan Computer Sciences Department University of Wisconsin, Madison baid,swetha@cs.wisc.edu Abstract Data
More information6. Results. This section describes the performance that was achieved using the RAMA file system.
6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding
More informationPhysical Level of Databases: B+-Trees
Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,
More informationPaper # Jazz it up a Little with Formats. Brian Bee, The Knowledge Warehouse Ltd
Paper #1495-2014 Jazz it up a Little with Formats Brian Bee, The Knowledge Warehouse Ltd Abstract Formats are an often under-valued tool in the SAS toolbox. They can be used in just about all domains to
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More informationCreate a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico
PharmaSUG 2011 - Paper TT02 Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico ABSTRACT Many times we have to apply formats and it could be hard to create them specially
More informationHash-Based Indexing 165
Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19
More informationAll About SAS Dates. Marje Fecht Senior Partner, Prowerk Consulting. Copyright 2017 Prowerk Consulting
All About SAS Dates Marje Fecht Senior Partner, Prowerk Consulting Copyright 2017 Prowerk Consulting 1 SAS Dates What IS a SAS Date? And Why?? My data aren t stored as SAS Dates How can I convert How can
More informationTOP 10 (OR MORE) WAYS TO OPTIMIZE YOUR SAS CODE
TOP 10 (OR MORE) WAYS TO OPTIMIZE YOUR SAS CODE Handy Tips for the Savvy Programmer SAS PROGRAMMING BEST PRACTICES Create Readable Code Basic Coding Recommendations» Efficiently choosing data for processing»
More informationChapter 8 & Chapter 9 Main Memory & Virtual Memory
Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array
More informationBatch Jobs Performance Testing
Batch Jobs Performance Testing October 20, 2012 Author Rajesh Kurapati Introduction Batch Job A batch job is a scheduled program that runs without user intervention. Corporations use batch jobs to automate
More informationDDS Dynamic Search Trees
DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion
More informationData Structure. IBPS SO (IT- Officer) Exam 2017
Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data
More informationUniversity of Waterloo Midterm Examination Solution
University of Waterloo Midterm Examination Solution Winter, 2011 1. (6 total marks) The diagram below shows an extensible hash table with four hash buckets. Each number x in the buckets represents an entry
More informationTackling Unique Problems Using TWO SET Statements in ONE DATA Step. Ben Cochran, The Bedford Group, Raleigh, NC
MWSUG 2017 - Paper BB114 Tackling Unique Problems Using TWO SET Statements in ONE DATA Step Ben Cochran, The Bedford Group, Raleigh, NC ABSTRACT This paper illustrates solving many problems by creatively
More informationShort Note. The unwritten computing rules at SEP. Alexander M. Popovici, Dave Nichols and Dimitri Bevc 1 INTRODUCTION
Stanford Exploration Project, Report 80, May 15, 2001, pages 1?? Short Note The unwritten computing rules at SEP Alexander M. Popovici, Dave Nichols and Dimitri Bevc 1 INTRODUCTION This short note is intended
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationInformation Lifecycle Management for Business Data. An Oracle White Paper September 2005
Information Lifecycle Management for Business Data An Oracle White Paper September 2005 Information Lifecycle Management for Business Data Introduction... 3 Regulatory Requirements... 3 What is ILM?...
More informationChecking for Duplicates Wendi L. Wright
Checking for Duplicates Wendi L. Wright ABSTRACT This introductory level paper demonstrates a quick way to find duplicates in a dataset (with both simple and complex keys). It discusses what to do when
More informationData Vault Partitioning Strategies WHITE PAPER
Dani Schnider Data Vault ing Strategies WHITE PAPER Page 1 of 18 www.trivadis.com Date 09.02.2018 CONTENTS 1 Introduction... 3 2 Data Vault Modeling... 4 2.1 What is Data Vault Modeling? 4 2.2 Hubs, Links
More information. NO MORE MERGE - Alternative Table Lookup Techniques Dana Rafiee, Destiny Corporation/DDISC Group Ltd. U.S., Wethersfield, CT
betfomilw tltlljri4ls. NO MORE MERGE - Alternative Table Lookup Techniques Dana Rafiee, Destiny Corporation/DDISC Group Ltd. U.S., Wethersfield, CT ABSTRACT This tutorial is designed to show you several
More informationFrom Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX
Paper 152-27 From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX ABSTRACT This paper is a case study of how SAS products were
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationHow to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?
How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U? Andrew T. Kuligowski Nielsen Media Research Abstract / Introduction S-M-U. Some people will see these three letters and immediately
More informationSAS File Management. Improving Performance CHAPTER 37
519 CHAPTER 37 SAS File Management Improving Performance 519 Moving SAS Files Between Operating Environments 520 Converting SAS Files 520 Repairing Damaged Files 520 Recovering SAS Data Files 521 Recovering
More informationBase and Advance SAS
Base and Advance SAS BASE SAS INTRODUCTION An Overview of the SAS System SAS Tasks Output produced by the SAS System SAS Tools (SAS Program - Data step and Proc step) A sample SAS program Exploring SAS
More informationIntro to DB CHAPTER 12 INDEXING & HASHING
Intro to DB CHAPTER 12 INDEXING & HASHING Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing
More informationData Warehousing. New Features in SAS/Warehouse Administrator Ken Wright, SAS Institute Inc., Cary, NC. Paper
Paper 114-25 New Features in SAS/Warehouse Administrator Ken Wright, SAS Institute Inc., Cary, NC ABSTRACT SAS/Warehouse Administrator 2.0 introduces several powerful new features to assist in your data
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationCSE 530A. B+ Trees. Washington University Fall 2013
CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key
More informationCS5412 CLOUD COMPUTING: PRELIM EXAM Open book, open notes. 90 minutes plus 45 minutes grace period, hence 2h 15m maximum working time.
CS5412 CLOUD COMPUTING: PRELIM EXAM Open book, open notes. 90 minutes plus 45 minutes grace period, hence 2h 15m maximum working time. SOLUTION SET In class we often used smart highway (SH) systems as
More informationVersion 8 Base SAS Performance: How Does It Stack-Up? Robert Ray, SAS Institute Inc, Cary, NC
Paper 9-25 Version 8 Base SAS Performance: How Does It Stack-Up? Robert Ray, SAS Institute Inc, Cary, NC ABSTRACT This paper presents the results of a study conducted at SAS Institute Inc to compare the
More informationClustering and Reclustering HEP Data in Object Databases
Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications
More informationLecture 13. Lecture 13: B+ Tree
Lecture 13 Lecture 13: B+ Tree Lecture 13 Announcements 1. Project Part 2 extension till Friday 2. Project Part 3: B+ Tree coming out Friday 3. Poll for Nov 22nd 4. Exam Pickup: If you have questions,
More informationSummarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization
Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization Michael A. Raithel, Raithel Consulting Services Abstract Data warehouse applications thrive on pre-summarized
More informationBest Practice for Creation and Maintenance of a SAS Infrastructure
Paper 2501-2015 Best Practice for Creation and Maintenance of a SAS Infrastructure Paul Thomas, ASUP Ltd. ABSTRACT The advantage of using metadata to control and maintain data and access to data on databases,
More informationCAPACITY PLANNING FOR THE DATA WAREHOUSE BY W. H. Inmon
CAPACITY PLANNING FOR THE DATA WAREHOUSE BY W. H. Inmon The data warehouse environment - like all other computer environments - requires hardware resources. Given the volume of data and the type of processing
More informationGuide Users along Information Pathways and Surf through the Data
Guide Users along Information Pathways and Surf through the Data Stephen Overton, Overton Technologies, LLC, Raleigh, NC ABSTRACT Business information can be consumed many ways using the SAS Enterprise
More informationVirtual Memory - Overview. Programmers View. Virtual Physical. Virtual Physical. Program has its own virtual memory space.
Virtual Memory - Overview Programmers View Process runs in virtual (logical) space may be larger than physical. Paging can implement virtual. Which pages to have in? How much to allow each process? Program
More informationIf You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC
Paper 2417-2018 If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC ABSTRACT Reading data effectively in the DATA step requires knowing the implications
More informationAre Your SAS Programs Running You? Marje Fecht, Prowerk Consulting, Cape Coral, FL Larry Stewart, SAS Institute Inc., Cary, NC
Paper CS-044 Are Your SAS Programs Running You? Marje Fecht, Prowerk Consulting, Cape Coral, FL Larry Stewart, SAS Institute Inc., Cary, NC ABSTRACT Most programs are written on a tight schedule, using
More informationHandling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC
Paper BB-206 Handling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC ABSTRACT Every SAS programmer knows that
More informationIndexing: Overview & Hashing. CS 377: Database Systems
Indexing: Overview & Hashing CS 377: Database Systems Recap: Data Storage Data items Records Memory DBMS Blocks blocks Files Different ways to organize files for better performance Disk Motivation for
More informationStephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX
1/0 Performance Improvements in Release 6.07 of the SAS System under MVS, ems, and VMS' Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX INTRODUCTION The
More informationAre you Still Afraid of Using Arrays? Let s Explore their Advantages
Paper CT07 Are you Still Afraid of Using Arrays? Let s Explore their Advantages Vladyslav Khudov, Experis Clinical, Kharkiv, Ukraine ABSTRACT At first glance, arrays in SAS seem to be a complicated and
More informationTable Lookups: From IF-THEN to Key-Indexing
Table Lookups: From IF-THEN to Key-Indexing Arthur L. Carpenter, California Occidental Consultants ABSTRACT One of the more commonly needed operations within SAS programming is to determine the value of
More informationFile Management By : Kaushik Vaghani
File Management By : Kaushik Vaghani File Concept Access Methods File Types File Operations Directory Structure File-System Structure File Management Directory Implementation (Linear List, Hash Table)
More information