Know What You Are Missing: How to Catalogue and Manage Missing Pieces of Historical Data

Similar documents
Acquisition and Management of Market Data for SAS Risk Dimensions

Interleaving a Dataset with Itself: How and Why

Statistics, Data Analysis & Econometrics

The Dataset Diet How to transform short and fat into long and thin

2 = Disagree 3 = Neutral 4 = Agree 5 = Strongly Agree. Disagree

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

Equities and Fixed Income. Introduction Manual

Ranking Between the Lines

Indenting with Style

Equities and Fixed Income. Introduction Manual

Useful Tips When Deploying SAS Code in a Production Environment

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Practical Uses of the DOW Loop Richard Read Allen, Peak Statistical Services, Evergreen, CO

Equities and Fixed Income. Introduction Manual

PhUse Practical Uses of the DOW Loop in Pharmaceutical Programming Richard Read Allen, Peak Statistical Services, Evergreen, CO, USA

How to write ADaM specifications like a ninja.

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

Get SAS sy with PROC SQL Amie Bissonett, Pharmanet/i3, Minneapolis, MN

Figure 1. Paper Ring Charts. David Corliss, Marketing Associates, Bloomfield Hills, MI

An approach to the risk management of structured financial products. Marco Cammi Unità di Risk Management della Banca Monte dei Paschi di Siena

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

Paper S Data Presentation 101: An Analyst s Perspective

Different Methods for Accessing Non-SAS Data to Build and Incrementally Update That Data Warehouse

An Efficient Tool for Clinical Data Check

PROGRAMMING ROLLING REGRESSIONS IN SAS MICHAEL D. BOLDIN, UNIVERSITY OF PENNSYLVANIA, PHILADELPHIA, PA

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Greenspace: A Macro to Improve a SAS Data Set Footprint

Clinical Data Visualization using TIBCO Spotfire and SAS

An Easy Way to Split a SAS Data Set into Unique and Non-Unique Row Subsets Thomas E. Billings, MUFG Union Bank, N.A., San Francisco, California

The Path To Treatment Pathways Tracee Vinson-Sorrentino, IMS Health, Plymouth Meeting, PA

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

BI-09 Using Enterprise Guide Effectively Tom Miron, Systems Seminar Consultants, Madison, WI

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

Unlock SAS Code Automation with the Power of Macros

Professional Services Tools Library. Release 2011 FP1

It s Proc Tabulate Jim, but not as we know it!

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas

Customized Flowcharts Using SAS Annotation Abhinav Srivastva, PaxVax Inc., Redwood City, CA

Exploring Data. This guide describes the facilities in SPM to gain initial insights about a dataset by viewing and generating descriptive statistics.

Hypothesis Testing: An SQL Analogy

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Combining Contiguous Events and Calculating Duration in Kaplan-Meier Analysis Using a Single Data Step

SAS System Powers Web Measurement Solution at U S WEST

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

Taming a Spreadsheet Importation Monster

Automate Clinical Trial Data Issue Checking and Tracking

A Side of Hash for You To Dig Into

Guide Users along Information Pathways and Surf through the Data

An Application of PROC NLP to Survey Sample Weighting

Speed Dating: Looping Through a Table Using Dates

Understanding and Applying the Logic of the DOW-Loop

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

Haas MFE SAS Workshop

Hot-deck Imputation with SAS Arrays and Macros for Large Surveys

AMD EPYC PRESENTS OPPORTUNITY TO SAVE ON SOFTWARE LICENSING COSTS

TIGRS REGISTRY REQUIREMENTS FOR QUALIFIED REPORTING ENTITIES

Introduction / Overview

ABSTRACT INTRODUCTION THE GENERAL FORM AND SIMPLE CODE

JMP Clinical. Release Notes. Version 5.0

Checking for Duplicates Wendi L. Wright

SAS IT Resource Management Forecasting. Setup Specification Document. A SAS White Paper

Chaining Logic in One Data Step Libing Shi, Ginny Rego Blue Cross Blue Shield of Massachusetts, Boston, MA

Are Your SAS Programs Running You?

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

The Need for Consistent IO Speed in the Financial Services Industry. Silverton Consulting, Inc. StorInt Briefing

Beginning Tutorials. PROC FSEDIT NEW=newfilename LIKE=oldfilename; Fig. 4 - Specifying a WHERE Clause in FSEDIT. Data Editing

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Mining Your Warranty Data Finding Anomalies (Part 1)

The new SAS 9.2 FCMP Procedure, what functions are in your future? John H. Adams, Boehringer Ingelheim Pharmaceutical, Inc.

Posters. Workarounds for SASWare Ballot Items Jack Hamilton, First Health, West Sacramento, California USA. Paper

Managing money for people with more important things to manage. Client Point Getting Started Guide

Using PROC SQL to Generate Shift Tables More Efficiently

David Beam, Systems Seminar Consultants, Inc., Madison, WI

Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex

Automating Preliminary Data Cleaning in SAS

Chapter 13 Multivariate Techniques. Chapter Table of Contents

Paper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation

Using SAS to Analyze CYP-C Data: Introduction to Procedures. Overview

DATA PROCESSING PROCEDURES FOR UCR EPA ENVIRONMENTAL CHAMBER EXPERIMENTS. Appendix B To Quality Assurance Project Plan

Are Your SAS Programs Running You? Marje Fecht, Prowerk Consulting, Cape Coral, FL Larry Stewart, SAS Institute Inc., Cary, NC

Introducing SAS Model Manager 15.1 for SAS Viya

Time Contour Plots. David J. Corliss Magnify Analytic Solutions, Detroit, MI

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

Missing Data. Where did it go?

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

Applications Development. Paper 38-28

Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

A Revolution? Development of Dynamic And Hypertext Linked Reports With Internet Technologies and SAS System

An Automation Procedure for Oracle Data Extraction and Insertion

Navigating the Clouds Fortifying ITIL for Cloud Governance

Transcription:

Know What You Are Missing: How to Catalogue and Manage Missing Pieces of Historical Data Shankar Yaddanapudi, SAS Consultant, Washington DC ABSTRACT In certain applications it is necessary to maintain and update historical data on a periodic basis. However the data might be missing in one or more chunks, resulting in gaps in history. This might have an adverse impact on the performance of the application consuming the historical data, and typically various techniques are used to fill in these missing chunks. Before such methods are employed however, the number and size of the missing chunks need to be assessed to decide the appropriate course of action. This is important because the best course of action in the presence of few missing values scattered all over, might be different from the case when there are large number of missing values concentrated in few spots. This situation is common in Risk Management Applications. In this paper a situation is presented where it is critical to maintain a historical database without any missing values. SAS coding techniques are then presented to assess the data, and make an inventory of the data, including the number and size of gaps in the data. Possible courses of action based on this information are also briefly discussed. INTRODUCTION The problem of dealing with missing data has been studied extensively from several view points. Much of this work has traditionally focused on various imputation techniques ranging from simple methods to sophisticated methods, such as those encapsulated in PROC MI. Numerous papers can be found in SUGI conferences about how to count and tabulate missing data as a percentage of total data, as well as how to impute the missing values. In this paper we present a situation, where it would be helpful to examine the missing data further deeper than is done usually. Then SAS code is presented to help with this analysis, followed by a brief discussion of various options available to deal with the missing data. A typical Risk Management system (RMS) consumes significant amounts of input data, and the SAS flagship product SAS Risk dimensions, is no exception. A major input to such a RMS, when used to implement market risk, is market data in the form of prices of various bonds, stocks, derivatives and other financial instruments. This data is typically utilized by the RMS to compute various risk metrics like VaR (Value-at-Risk), using a variety of methods like Historical simulation, and Monte Carlo based methods. While there is certain flexibility in choosing the particular risk computation methods based on the objectives, it is generally agreed that the input market data should be of high quality, i.e., enough historical data should be available, and should be accurate without missing values. Financial firms go to great lengths in securing market data, Bloomberg and Reuters being two major vendors of such data. However, in spite of best efforts, sometimes the input market data does have missing data, in one or more chunks spread across the time series. There could be several reasons for this, for example, certain low volume equity options might not have been traded on several business days at a stretch. The data analyst needs to know the reason why the data is missing in order to come up with good solutions to suit his or her objectives. A first step in studying the missing data would be to simply tabulate the number and size of the gaps in the data, and try to observe the patterns. SUMMARIZING THE MISSING DATA The missing data can be summarized in several ways, but a simple DATA step method will be presented here. Example data is presented in Fig 1.0 which lists three fictitious stock symbols with their closing prices. Stock AAXX has three gaps with sizes 1,2 and 3, while BBCD has two gaps, of sizes 6 and 1. CCCD has no gaps in the data. SAS code in Fig 2, shows the approach taken to count the number and size of gaps. 1

Fig 2. Sample Code to Extract Size and umber of Gaps in a Data Set /* Sort the input data set */ PROC SORT DATA=prices; BY ID date; ** ID is stock symbol; RUN; /* count the number and size of the gaps */ DATA missgaps; SET prices; BY ID date; RUN; RETAIN ngaps size flag; ** ngaps is the number of gaps, size is the gap size, and flag indicates a missing value; IF FIRST.ID THEN DO; ngaps=0; size=0; flag=0; /* identify missing data */ IF MISSING(price) EQ 1 THEN gap=1; ELSE gap=0; /* start counting from the first missing data point */ IF gap EQ 1 THEN DO; flag=1; size+1; /* when a non-missing value is encountered, increment gap number and reset */ IF (gap EQ 0 AND flag EQ 1) OR (flag EQ 1 AND LAST.ID) THEN DO; flag=0; ngaps+1; size=0; /* extract the gaps and their sizes, and their starting and ending dates */ PROC SQL NOPRINT; CREATE TABLE misssummary AS SELECT ID,1+MAX(ngaps) as GapNumber,MAX(size) as GapSize, MIN(date) as StartDate format=mmddyy10., MAX(date) as EndDate format=mmddyy10. FROM missgaps WHERE gap GT 0 GROUP BY ID,ngaps ORDER BY ID ; QUIT; 2

The input data set is sorted first by the variable ID, which is the stock symbol. In the following DATA step, three variables are initialized using RETAIN statement: ngaps is the number of gaps, size is the gap size, and flag indicates a missing value. MISSING function is then used to identify the missing values and flag indicator is set accordingly. The variable size is incremented with each successive missing value. When the missing series ends, either when a non-missing value is encountered, or when data for a new security begins, flag and size are reset to zero. Finally a PROC SQL query extracts the number of gaps and their sizes. The contents of the intermediate Fig. 3.0 Listing of the data set missgaps Obs Date ID Price ngaps size flag gap 1 07/26/2010 AAXX 35.82 0 0 0 0 2 07/27/2010 AAXX. 0 1 1 1 3 07/28/2010 AAXX. 0 2 1 1 4 07/29/2010 AAXX 33.73 1 0 0 0 5 07/30/2010 AAXX 35.94 1 0 0 0 6 08/02/2010 AAXX. 1 1 1 1 7 08/03/2010 AAXX 35.81 2 0 0 0 8 08/04/2010 AAXX 37.38 2 0 0 0 9 08/05/2010 AAXX. 2 1 1 1 10 08/06/2010 AAXX. 2 2 1 1 11 08/09/2010 AAXX. 2 3 1 1 12 08/10/2010 AAXX 36.39 3 0 0 0 13 08/11/2010 AAXX 36.44 3 0 0 0 14 08/12/2010 AAXX 35.76 3 0 0 0 15 07/26/2010 BBCD. 0 1 1 1 16 07/27/2010 BBCD. 0 2 1 1 17 07/28/2010 BBCD. 0 3 1 1 18 07/29/2010 BBCD. 0 4 1 1 19 07/30/2010 BBCD. 0 5 1 1 20 08/02/2010 BBCD. 0 6 1 1 21 08/03/2010 BBCD 78.22 1 0 0 0 22 08/04/2010 BBCD. 1 1 1 1 23 08/05/2010 BBCD 77.41 2 0 0 0 24 08/10/2010 CCCD 77.16 0 0 0 0 25 08/11/2010 CCCD 77.23 0 0 0 0 26 08/12/2010 CCCD 77.76 0 0 0 0 and the final data sets are presented in figures 3 and 4. As seen in Fig4, for each security all the gaps and their sizes are listed, along with the dates showing the starting and ending points of the gap. If needed the SQL query can be modified to extract the size of the maximum gap. If a security does not have any gaps, it is not listed. The code presented in Figure 2 demonstrates the basic idea, and can be modified to be more efficient and compact, and can be easily converted into a macro. ANALYZING THE MISSING DATA The question arises how this summarized information can be of use to a data analyst. As an example consider the case of Constant Maturity Treasury (CMT) rates, an important piece of market data used in a typical RMS. A listing of this data can be found at: http://www.treas.gov/offices/domestic-finance/debt-management/interest-rate/yield.shtml If this data is summarized as described above, one would notice a very large gap for CMT 30 series, during the period from February 18, 2002 to February 8, 2006. A further investigation would reveal that the data is missing because this CMT series was discontinued during this period, and that the Treasury has published alternate rates 3

to this series. Now armed with this information, the analyst can decide whether to use the alternate rates, or some other method to fill in the gap, depending on his or her objectives. Fig. 4.0 Listing of the data set misssummary ID GapNumber GapSize StartDate EndDate AAXX 1 2 07/27/2010 07/28/2010 AAXX 2 1 08/02/2010 08/02/2010 AAXX 3 3 08/05/2010 08/09/2010 BBCD 1 6 07/26/2010 08/02/2010 BBCD 2 1 08/04/2010 08/04/2010 On the other hand, if the number of gaps and the gas sizes are small, there might be other reasons why the data is missing. For example, the market data vendor s database might not have been updated, or the security in question might not have been traded on some days. DEALING WITH THE MISSING DATA There are several options to deal with missing data as elaborated in numerous papers. Some of the simplest methods include LOCF, substitution by mean, substitution by regression while multiple imputation methods available in PROC MI represent the state of the art. The choice of the method employed depends on several factors including the objectives of the analysis, as well as time and computational constraints. In the case of market data, the data sets tend to have hundreds of risk factors (variables) to be managed. And a typical RMS consumes significant amount of computational resources and can take several hours each day, for gathering and transforming portfolio and market data, generating simulations for various market states and computing risk measures. Some of these steps are time intensive and there is pressure to minimize processing time in each step. Advanced methods available in PROC MI are not always feasible under these circumstances. However, due to the unique nature of objectives of a RMS, some unconventional methods are available to analysts to fill in the missing data. The major objective of a RMS is to measure risk across a portfolio, as reflected by the market states. To accomplish this, a covariance matrix is built which represents correlations between all the risk factors. When some of these risk factors are missing, a security which represents either the entire market or the security in question can be used as a proxy. For example, S&P500 is a well regarded market index which represents the overall market fairly well. In the case where the missing data is scattered across, the analyst might conclude that the more traditional methods like LOCF or substitution by regression, or more appropriate. These choices can easily be automated in a SAS program, which would first compute the number and size of gaps, and then use this information to implement various options to fill in the missing data. 4

CONCLUSIONS It some situations it would be beneficial to consider the number and size of gaps in the missing data, before filling in the data. This can be easily done using standard SAS techniques. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Please contact the author if you have any questions or comments: Shankar Yaddanapudi Paradigm Infotec Columbia, MD 21045 Email: shankar.stat@gmail.com Fig. 1.0 Example Input data ID DATE PRICE ID DATE PRICE AAXX 07/26/2010 35.82 BBCD 07/26/2010. AAXX 07/27/2010. BBCD 07/27/2010. AAXX 07/28/2010. BBCD 07/28/2010. AAXX 07/29/2010 33.73 BBCD 07/29/2010. AAXX 07/30/2010 35.94 BBCD 07/30/2010. AAXX 08/02/2010. BBCD 08/02/2010. AAXX 08/03/2010 35.81 BBCD 08/03/2010 78.22 AAXX 08/04/2010 37.38 BBCD 08/04/2010. AAXX 08/05/2010. BBCD 08/05/2010 77.41 AAXX 08/06/2010. CCCD 08/10/2010 77.16 AAXX 08/09/2010. CCCD 08/11/2010 77.23 AAXX 08/10/2010 36.39 CCCD 08/12/2010 77.76 AAXX 08/11/2010 36.44 AAXX 08/12/2010 35.76 5