Will Your Data Warehouse Stand the Test of rime? David Annis, Amadeus Data Processing, Germany

Size: px

Start display at page:

Download "Will Your Data Warehouse Stand the Test of rime? David Annis, Amadeus Data Processing, Germany"

Janice Carter
5 years ago
Views:

1 Will Your Data Warehouse Stand the Test of rime? David Annis, Amadeus Data Processing, Germany As storage becomes cheaper, we have to be more careful rather than less careful about how we design our historical databases. Whatever the style or architecture of your data, and whether or not it conforms to the standards for a data warehouse, we all benefit from the increased awareness amongst users and managers of 'Data Warehousing' as a concept. because as the demand for information delivery becomes greater, there is a good chance that you will be given enough disk space to keep several years' worth of data on-line, thus avoiding the administrative overhead of archiving and restoring old information. With this benefit, however, comes an added responsibility to design your databases, and the update procedures in such a way that they will still be feasible in several years' time. In this paper,l will examine two aspects of dealing with large amounts of historical data that will probably be common to all business areas: 1. The nightly update - getting yesterday's information into the warehouse. 2. Coding systems - storing and accessing coded information. Amadeus Amadeus is a global computerised reservation system that provides centralised access to travel services including airlines, cars and hotels. Within the Capacity and Performance Department, we have a responsibility to ensure that the systems and network resources are provided at optimum value for money to the customer. As a result, we also have to ensure that our own data analysis procedures do not use resources that would be better allocated to the main production systems. The nightly update The common objectives To format yesterday's business information for inclusion into the database. Ensure no duplication Set up any indexes that are required Assumptions When adding yesterday's data into your database, the objective is make the process as automated and secure as possible. Assuming that all the procedures involved are not perfect, you will need to consider the possibility that the job will have to be re-run, and any data that was added would have to be replaced. You also need to design the database so that frequent queries have a good response time. Typically this means indexing. It is probably safe to assume (or even enforce) that if a user makes a query against a historical database, he/she will be interested in a specific time period. As an example, consider the work flow involved in updating the amadeus network performance database. 919

2 Example Work flow Network Statistics: Network Statistics: Network Statistics: Network Statistics: Resource: SNA Resource: IDG Resource: BBN Resource: HYP Create common Format in SAS Datasets Combine J r- I-- FAdd to ~ ( Database "./ This represents a fairly common type of work flow: different sources of business information being combined and perhaps summarised before being added into the main database(s}. In our.case, all the automated checks for data validity cannot guarantee that a quick read of the daily report will not show that the process needs to be restarted from an earlier step. As a result, the process of adding to the database needs to ensure that yesterday's data is not simply duplicated. Example Methods The following example methods are extracts of code which could be used to add a daily dataset (TODAY) to a historical dataset (MAIN. CUSTOMER). 920

3 Method 1: Using SET. %let date=today()-l; data MAIN.CUSTOMER; set MAIN.CUSTOMER(where=(date ne &date)) TODAY(where=(date eq &date)); Consequences If any previous data from &DATE was added, it will be replaced. If the newly created data TODAY contains any data from another day, this will not be added. Method 2: Using MERGE. The example assumes one observation per customer per day. data MAIN.CUSTOMER; merge MAIN.CUSTOMER TODAY; by date customer; Consequences If the process is re-run, and the number of customers is different on the second run, some observations from the first run will not be replaced. Disadvantage of Methods 1 and 2. Although the consequences of methods 1 and 2 can be avoided with a little extra code, I am not going to expand on them as they both have one major disadvantage: both methods perform a sequential read of the main dataset. This means that the resources used by the nightly update would grow in a linear way over time. Method 3: Using a unique index. data TODAY(INDEX=(key=(date customer) /UNIQUE)); <more statements> proc append base=main.customer new=today; Consequences You could describe this method as an 'intelligent' version of Method 2 above. Note that in this method also, if the process is re-run, and the number of customers is different on the second run, some observations from the first run will not be replaced. This could be rectified using the MODIFY statement, and removing the observations in place, but this is only advisable if your database is backed up, as a system error in the middle of the data step could destroy your data. This method does have the distinct advantage, however, that the data will be indexed in a useful way. It is useful because almost all queries against a historical database will be subsetting by the date in some way or other. 921

4 f'=-~"="=="-=~ ~~ The other consequences of this method are overheads in disk space and run-time resources. This will be discussed in a later section. Method 4: Using a Journal When designing the amadeus Network Performance Database, the major objective was the automation of the daily update, and ensuring that re-runs and exceptions could be handled simply. As a result, the first approach was to maintain a dataset containing all the dates where an update was performed. In other words, a simple journal dataset. This had the following benefits: Easy to avoid duplication of data. The update can be designed to run as many days as necessary to bring the database up to date. On further examination, it became obvious that this journal dataset could also hold pointers (observation numbers) to the start and end of each day's data in the main database. The number of observations is also held in the journal dataset as a check, and the final structure is presented below. -,68' 'EoSs 'NOSS 'Journal Dataset Main Dataset jun94 I :- 12JUn94:23:45jX L I: r~oo:15 i~l.. j12jun94:23:45ix99999 ~I! This data structure is maintained by the macro %JAPPEND. This in turn uses two further macros which are described briefly below: %nobs finds the number of observations from a dataset and puts the result into a global macro variable. %jgetdate finds the date from a dataset. This works by searching from the middle of the dataset for the first observation with a non-missing date value. The macro is designed in this way mainly for performance data, where the records are likely to be sorted by time, with some data from just before or just after midnight at the beginning or end of the dataset. Note that in our example, this data is discarded. For the sake of readability, I have removed the error handling code from the example overleaf. 922

5 * Macro JAPPEND is intended to append a days worth of data to a history database. The following conditions are required: 1. BASE and NEW datasets must have the same variables 'etc as the append step does not use the FORCE option. 2. A journal dataset must exist with the following variables: DATE, SOBS, EOBS, NOBS.. 3. The NEW dataset must contain a date or datetime variable. EXAMPLE CALL: %jappend(base=base,new=new,dateval=%str(datepart(dt»,journal=journal) ; %macro jappend(base=,new=,journal=,dateval=); %jgetdate (data=&new,dateval=&dateval, result=newdate) %if &newdate=. %then... ERROR... *********************************************************************; * Check in journal for same date : * The delete is done after the append using the _DELETE_ dataset and the DELOBS macrovar.; *********************************************************************; data delete; set &journal; where date=&newdate; %nobs(data=_delete_,macvar=delobs); ************************************************************** * Find the number of observations before and after the append **************************************************************; %nobs(data=&base,macvar=baseobs); proc append base=&base new=&new(where=(&dateval=&newdate»; %nobs(data=&base,macvar=newobs); data newobs; - - date=&newdate; sobs=&baseobs+1; eobs=&newobs; nobs=&newobs-&baseobs; if nobs=q then... ERROR... * If date already on the database, dele~e the data in place *********; %if &delobs>q %then %do; data &base; set delete; - - do end; obno =sobs to eobs; -modify &base point=_obno ;. remove; **** Mark obs as deleted in journal before adding new entry; data &journal; set &journal; if date=&newdate then date=.; %end; **** Add new journal entry ******************************************; proc append force base=&journal new=_newobs_; %mend jappend; 923

6 Queries Using a Journal Of course one disadvantage of developing a non-standard indexing scheme is that the index is not automatically used. The following macro, however, does provide a simple way for making use of the index. Note that I am presenting here a simplified version. The production version of this macro also contains code for the creation of DATASTEP VIEWS, and for the optimisation of subsetting by variables other than the date. The optimisation is achieved by performing a binary search, as it is known that the daily data is sorted before being JAPPENDed to the database. %macro JGET(data=,out=,journal=,dateval=,sdate=,edate=); data &out(drop=sobs eobs nobs); set &journal(where=(&dateval between &sdate and &edate)); do end; obs =sobs to eobs; set &data point=_obs_ output; %mend JGET; Resources One of the main reasons for developing the Journal Index method described above was the intuition that this would save considerable resources when compared to a unique index. Therefore we created some simulations to compare the disk utilisation and run-time utilisation of the two methods. The figures given below are intended as a qualitative comparison of the two methods. The intention is to simulate how the methods would behave over a long period of time. Therefore I have tended to simplify the graphs and avoid using too many figures, as the emphasis is to show the relative trend of the resources used rather than the actual numerical details. Disk Space As you might expect, tests indicated that, when using a unique index, the size of the index file was dependant only on the number of observations, and the number of indexes. As a result, the overhead in terms of percent decreases as the number of non-indexed variables increases. A graph showing this relationship is shown below: all variables are 8 bytes, there are 10,000 observations in the sample dataset. Note that using a non-unique index, another influencing factor is the number of observations per group; but with 10,000 observations, the percentage of overhead was similar to the unique index for all levels of grouping. 924

7 .. ~isk overhead: Unique Index on Two 8 byte variable) 100% 90% 80% ~ r\ C e a. 70% 60%.!E 50% '1:1.. as.r. > 40% 0 30% 20% 10't'fj 0% :\ -...!!,.. - \ f- \ r I~ r ~ f' ~ I- r r r--! r-- []~~~- I,! I I I I I I _1 ~ ~ ~ ~..1 ~ ~ _ No. of Other Variable. By comparison, the disk space overhead for the Journal Index is virtually zero, so the graph effectively shows the percentage savings of the Journal index oyer a standard SAS index. Execution Resources In order to compare the execution resource overhead of using a unique index, we ran some simulations in which we created a daily dataset, containing a date plus one other numeric variable, which both went to.. make up the unique index. We appended this repeatedly to a main dataset, changing the date each time. In the first test, we appended 10,000 observations per day, and repeated this 100 times, so that the main dataset contained 1 million observations by the end of the test. For each iteration, we measured the CPU and 10 utilisation of the append step. The objective was to show whether the resources consumed by the append increased with the size of the base dataset. As a comparison, we did the same test using a unique index, and using a journal. A representation of the results of the CPU measurements is shown overleaf. 925

8 ~PU Utilisation Comparison Append Step Only 1.4, ' := t1 ~ I: o c7i is Journal... Unique L--...L...;...II-- -'-- -.l. ---I o Years In this representation I have extrapolated the results linearly to show the relative performance over a 4 year period. One surprising thing to note, is that the CPU resources used by the unique index are lower until about year 3 in our example (about 11 million observations). This is in fact NOT due to the overhead of creating the journal, as this is negligible: PROC APPEND itself appears to use less CPU with a unique index than with no index at all. Adding more non-indexed variables into the example has the effect of moving the baseline for both techniques, but the convergence point remains the same. In the same example, adding in the resources used by the creation step gives the following results. CPU Utilisation Comparison Create and Append Steps Corrtlined 1.6, : ,-----'---'------\----'-...,..., 1.4 ~ o ~ 1.2 ~ u... Journal... Unique 0.8 '-- L ~ L '-- ---J o 2 Years 3 4 In this case the convergence point is after 1.7 years (6.2 Million observations). 926

9 And finally, the 10 utilisation patterns. 10 Utilisation Q>mparison O'eate and Append Steps Corrbined 300, , "2 o '* ~ ~-----~-----~-----~-----~ o 4 Years In the case of 10, the unique index consumed more 10 resources from day 1, and grew slowly over time. The growth rate in our example extrapolated to approximately 25% over three years. Resources Used by Non-Unique Indexes Note that I have not shown any performance comparisons for non-unique indexes because they can be directly compared to one of the other two methods: Disk space: comparable to a unique index 10 Resources: comparable toa :unique index CPU resources: comparable to the unique index in the creation step, and to a journal index in the append step. Conclusions For theamadi=us Network Performance Database, the journal index method has distinct advantages for the following r~asons: The number of variables in the database is small, so the overheads of an index would be high in percentage terms. I.. The number of key variables required t6 make the index unique would be high. The database is most frequently used for regular reporting of ail network resources within a given date period, so having the database indexed by date only is sufficienrmost of the time. For ad-h09 queries, the date is always used as a subsetting factor" and this makes the response time acceptable in most cases. Within a day's worth of data, the dataset is sorted by resource, allowing other optimisations within the query process (which I have not described here). If; however, your data is not suited to this form of index, it should cert~inly be possible to achieve a low rate \ of growth for the update and query;:-rocess by intelligent use of SAS indexes, or by direct access with the '" POINT = dataset option. 927

10 Coding Systems and Lookup Tables Most of you have probably experienced some of the problems and frustrations that beset knowledge workers who are trying to maintain or query data going a long way back in time. One cause of this might be due to coding systems that have changed over time. As a designer of a data warehouse or historical database, it is wise to consider some additions to Murphy's law when applied to coding systems. Obsolete codes will be re-used to save introducing a new coding system. When the new coding system is introduced, no mapping will be possible to or from the old system. The date that they changed the coding system depends on whom you ask. The new codes look the same as the old codes; they just mean something different. Some of these problems are not within the scope of this paper, and possibly not even within the capabilities of the author to answer. However, some can be alleviated by using date sensitive (or even time sensitive) lookups. This can be achieved using formats. Example: Mapping Network Resources to Customer Within amadeus, network resources are frequently reviewed and modified to provide optimum bandwidth to a customer. In order to produce management reports to show the utilisation of the network by customer, we ideally need to map the line identifier to the customer for any 15 minute interval. Considering that we normally learn of changes after the event, if we were to store the customer as part of the database, we would have to re-process the daily data whenever a change was made. Therefore we maintain a table that looks as follows: Resource Type Resource 10 Effective Date Customer NPSI X JUL94:00:00 Luftahansa NPSI X NOV94:15:15 Air France... NPSI X SEP94:00:45 Iberia,... NPSI X NOV94:15:15 Lufthansa NPSI X MAR95:12:00 SAS NPSI X NOV94:15:15 Iberia NPSI X CT94:18:00 Air France NPSI X NOV94:15:t5 SAS This table is then processed to create a dataset for processing by PROC FORMAT using the CNTLIN option. In this example, a character format called NETUSER is created. 928

11 .~~,---~ ~ data formats; length start end $40 label $40; keep fmtname type start end label eexcl sexcl hlo; set data.netspeed end=last; by restype resid; /* Create enddate from next observation */ if last. resid then enddate= ; else do; next= n +1; set data.netspeed(keep=effdate rename=(effdate=enddate» point=next; end; /* Join the code and the date to make the lookup */ start=restype I I resid I I put (effdate, z12. ) ; end=restypel Iresidl Iput(enddate,z12.); /* Other variables used by PROC FORMAT */ fmtname='netuser'; type='c' ; eexcl='y'; sexcl='n'; /* Include start, exclude end */ label=user; output; /* Default label */ if last then do start='**other**'; end='**other**'; hlo='o' ; label=' '; output; end; return; Once the format is compiled, the statement to find the customer based on the resource and the datetime becomes very simple: I user=put (restype II residllput (datetime, z12.), $ne.tuser.) ; One important thing to note is that the length of the codes (in our case RESTYPE and RESID) should have the same format and length in the lookup table and in the database. Advantages By making your lookup tables date sens~lve, itispossible to avoidrhany of the problems of changing coding systems. At the design l)tage, it is safest to assume that all codes and coding systems may change within the lifetime of your data warehouse. Of course, the length of the code may also Change, so it is wise to leave some free space within t~e variable to account for this.. If (or when!) codes change, simplyad~ these codes into your lookup table with a new effective date, run the format compile as shown above, and the change is reflected automati ally from the new date, :with no change necessary in any programs. ' ;..:1 929

12 Summary As a data warehouse or performance database designer two years ago, you were probably restricted in disk space to about two year's worth of data. Now it appears that the demand for historical information is growing at the same pace as the data itself. As designers today, therefore, it is olir responsibility to ensure that the processes for updating and retrieving the information will not be the limiting factor in supplying this demand. In short, we have to ensure that our data warehouses will stand the test of Ume. BAS is a registered trademark of SAS Institute Inc., Cary, NC, USA Acknowledements: The idea for the journal index was inspired by a conversation with my colleague Sean Chaffee. \ 930

SAS Scalable Performance Data Server 4.3

SAS Scalable Performance Data Server 4.3 Scalability Solution for SAS Dynamic Cluster Tables A SAS White Paper Table of Contents Introduction...1 Cluster Tables... 1 Dynamic Cluster Table Loading Benefits... 2 Commands for Creating and Undoing