Know What You Are Missing: How to Catalogue and Manage Missing Pieces of Historical Data

Know What You Are Missing: How to Catalogue and Manage Missing Pieces of Historical Data Shankar Yaddanapudi, SAS Consultant, Washington DC ABSTRACT In certain applications it is necessary to maintain and update historical data on a periodic basis. However the data might be missing in one or more chunks, resulting in gaps in history. This might have an adverse impact on the performance of the application consuming the historical data, and typically various techniques are used to fill in these missing chunks. Before such methods are employed however, the number and size of the missing chunks need to be assessed to decide the appropriate course of action. This is important because the best course of action in the presence of few missing values scattered all over, might be different from the case when there are large number of missing values concentrated in few spots. This situation is common in Risk Management Applications. In this paper a situation is presented where it is critical to maintain a historical database without any missing values. SAS coding techniques are then presented to assess the data, and make an inventory of the data, including the number and size of gaps in the data. Possible courses of action based on this information are also briefly discussed. INTRODUCTION The problem of dealing with missing data has been studied extensively from several view points. Much of this work has traditionally focused on various imputation techniques ranging from simple methods to sophisticated methods, such as those encapsulated in PROC MI. Numerous papers can be found in SUGI conferences about how to count and tabulate missing data as a percentage of total data, as well as how to impute the missing values. In this paper we present a situation, where it would be helpful to examine the missing data further deeper than is done usually. Then SAS code is presented to help with this analysis, followed by a brief discussion of various options available to deal with the missing data. A typical Risk Management system (RMS) consumes significant amounts of input data, and the SAS flagship product SAS Risk dimensions, is no exception. A major input to such a RMS, when used to implement market risk, is market data in the form of prices of various bonds, stocks, derivatives and other financial instruments. This data is typically utilized by the RMS to compute various risk metrics like VaR (Value-at-Risk), using a variety of methods like Historical simulation, and Monte Carlo based methods. While there is certain flexibility in choosing the particular risk computation methods based on the objectives, it is generally agreed that the input market data should be of high quality, i.e., enough historical data should be available, and should be accurate without missing values. Financial firms go to great lengths in securing market data, Bloomberg and Reuters being two major vendors of such data. However, in spite of best efforts, sometimes the input market data does have missing data, in one or more chunks spread across the time series. There could be several reasons for this, for example, certain low volume equity options might not have been traded on several business days at a stretch. The data analyst needs to know the reason why the data is missing in order to come up with good solutions to suit his or her objectives. A first step in studying the missing data would be to simply tabulate the number and size of the gaps in the data, and try to observe the patterns. SUMMARIZING THE MISSING DATA The missing data can be summarized in several ways, but a simple DATA step method will be presented here. Example data is presented in Fig 1.0 which lists three fictitious stock symbols with their closing prices. Stock AAXX has three gaps with sizes 1,2 and 3, while BBCD has two gaps, of sizes 6 and 1. CCCD has no gaps in the data. SAS code in Fig 2, shows the approach taken to count the number and size of gaps. 1

Fig 2. Sample Code to Extract Size and umber of Gaps in a Data Set /* Sort the input data set */ PROC SORT DATA=prices; BY ID date; ** ID is stock symbol; RUN; /* count the number and size of the gaps */ DATA missgaps; SET prices; BY ID date; RUN; RETAIN ngaps size flag; ** ngaps is the number of gaps, size is the gap size, and flag indicates a missing value; IF FIRST.ID THEN DO; ngaps=0; size=0; flag=0; /* identify missing data */ IF MISSING(price) EQ 1 THEN gap=1; ELSE gap=0; /* start counting from the first missing data point */ IF gap EQ 1 THEN DO; flag=1; size+1; /* when a non-missing value is encountered, increment gap number and reset */ IF (gap EQ 0 AND flag EQ 1) OR (flag EQ 1 AND LAST.ID) THEN DO; flag=0; ngaps+1; size=0; /* extract the gaps and their sizes, and their starting and ending dates */ PROC SQL NOPRINT; CREATE TABLE misssummary AS SELECT ID,1+MAX(ngaps) as GapNumber,MAX(size) as GapSize, MIN(date) as StartDate format=mmddyy10., MAX(date) as EndDate format=mmddyy10. FROM missgaps WHERE gap GT 0 GROUP BY ID,ngaps ORDER BY ID ; QUIT; 2

The input data set is sorted first by the variable ID, which is the stock symbol. In the following DATA step, three variables are initialized using RETAIN statement: ngaps is the number of gaps, size is the gap size, and flag indicates a missing value. MISSING function is then used to identify the missing values and flag indicator is set accordingly. The variable size is incremented with each successive missing value. When the missing series ends, either when a non-missing value is encountered, or when data for a new security begins, flag and size are reset to zero. Finally a PROC SQL query extracts the number of gaps and their sizes. The contents of the intermediate Fig. 3.0 Listing of the data set missgaps Obs Date ID Price ngaps size flag gap 1 07/26/2010 AAXX 35.82 0 0 0 0 2 07/27/2010 AAXX. 0 1 1 1 3 07/28/2010 AAXX. 0 2 1 1 4 07/29/2010 AAXX 33.73 1 0 0 0 5 07/30/2010 AAXX 35.94 1 0 0 0 6 08/02/2010 AAXX. 1 1 1 1 7 08/03/2010 AAXX 35.81 2 0 0 0 8 08/04/2010 AAXX 37.38 2 0 0 0 9 08/05/2010 AAXX. 2 1 1 1 10 08/06/2010 AAXX. 2 2 1 1 11 08/09/2010 AAXX. 2 3 1 1 12 08/10/2010 AAXX 36.39 3 0 0 0 13 08/11/2010 AAXX 36.44 3 0 0 0 14 08/12/2010 AAXX 35.76 3 0 0 0 15 07/26/2010 BBCD. 0 1 1 1 16 07/27/2010 BBCD. 0 2 1 1 17 07/28/2010 BBCD. 0 3 1 1 18 07/29/2010 BBCD. 0 4 1 1 19 07/30/2010 BBCD. 0 5 1 1 20 08/02/2010 BBCD. 0 6 1 1 21 08/03/2010 BBCD 78.22 1 0 0 0 22 08/04/2010 BBCD. 1 1 1 1 23 08/05/2010 BBCD 77.41 2 0 0 0 24 08/10/2010 CCCD 77.16 0 0 0 0 25 08/11/2010 CCCD 77.23 0 0 0 0 26 08/12/2010 CCCD 77.76 0 0 0 0 and the final data sets are presented in figures 3 and 4. As seen in Fig4, for each security all the gaps and their sizes are listed, along with the dates showing the starting and ending points of the gap. If needed the SQL query can be modified to extract the size of the maximum gap. If a security does not have any gaps, it is not listed. The code presented in Figure 2 demonstrates the basic idea, and can be modified to be more efficient and compact, and can be easily converted into a macro. ANALYZING THE MISSING DATA The question arises how this summarized information can be of use to a data analyst. As an example consider the case of Constant Maturity Treasury (CMT) rates, an important piece of market data used in a typical RMS. A listing of this data can be found at: http://www.treas.gov/offices/domestic-finance/debt-management/interest-rate/yield.shtml If this data is summarized as described above, one would notice a very large gap for CMT 30 series, during the period from February 18, 2002 to February 8, 2006. A further investigation would reveal that the data is missing because this CMT series was discontinued during this period, and that the Treasury has published alternate rates 3

to this series. Now armed with this information, the analyst can decide whether to use the alternate rates, or some other method to fill in the gap, depending on his or her objectives. Fig. 4.0 Listing of the data set misssummary ID GapNumber GapSize StartDate EndDate AAXX 1 2 07/27/2010 07/28/2010 AAXX 2 1 08/02/2010 08/02/2010 AAXX 3 3 08/05/2010 08/09/2010 BBCD 1 6 07/26/2010 08/02/2010 BBCD 2 1 08/04/2010 08/04/2010 On the other hand, if the number of gaps and the gas sizes are small, there might be other reasons why the data is missing. For example, the market data vendor s database might not have been updated, or the security in question might not have been traded on some days. DEALING WITH THE MISSING DATA There are several options to deal with missing data as elaborated in numerous papers. Some of the simplest methods include LOCF, substitution by mean, substitution by regression while multiple imputation methods available in PROC MI represent the state of the art. The choice of the method employed depends on several factors including the objectives of the analysis, as well as time and computational constraints. In the case of market data, the data sets tend to have hundreds of risk factors (variables) to be managed. And a typical RMS consumes significant amount of computational resources and can take several hours each day, for gathering and transforming portfolio and market data, generating simulations for various market states and computing risk measures. Some of these steps are time intensive and there is pressure to minimize processing time in each step. Advanced methods available in PROC MI are not always feasible under these circumstances. However, due to the unique nature of objectives of a RMS, some unconventional methods are available to analysts to fill in the missing data. The major objective of a RMS is to measure risk across a portfolio, as reflected by the market states. To accomplish this, a covariance matrix is built which represents correlations between all the risk factors. When some of these risk factors are missing, a security which represents either the entire market or the security in question can be used as a proxy. For example, S&P500 is a well regarded market index which represents the overall market fairly well. In the case where the missing data is scattered across, the analyst might conclude that the more traditional methods like LOCF or substitution by regression, or more appropriate. These choices can easily be automated in a SAS program, which would first compute the number and size of gaps, and then use this information to implement various options to fill in the missing data. 4

CONCLUSIONS It some situations it would be beneficial to consider the number and size of gaps in the missing data, before filling in the data. This can be easily done using standard SAS techniques. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Please contact the author if you have any questions or comments: Shankar Yaddanapudi Paradigm Infotec Columbia, MD 21045 Email: shankar.stat@gmail.com Fig. 1.0 Example Input data ID DATE PRICE ID DATE PRICE AAXX 07/26/2010 35.82 BBCD 07/26/2010. AAXX 07/27/2010. BBCD 07/27/2010. AAXX 07/28/2010. BBCD 07/28/2010. AAXX 07/29/2010 33.73 BBCD 07/29/2010. AAXX 07/30/2010 35.94 BBCD 07/30/2010. AAXX 08/02/2010. BBCD 08/02/2010. AAXX 08/03/2010 35.81 BBCD 08/03/2010 78.22 AAXX 08/04/2010 37.38 BBCD 08/04/2010. AAXX 08/05/2010. BBCD 08/05/2010 77.41 AAXX 08/06/2010. CCCD 08/10/2010 77.16 AAXX 08/09/2010. CCCD 08/11/2010 77.23 AAXX 08/10/2010 36.39 CCCD 08/12/2010 77.76 AAXX 08/11/2010 36.44 AAXX 08/12/2010 35.76 5