Combining Contiguous Events and Calculating Duration in Kaplan-Meier Analysis Using a Single Data Step

Combining Contiguous Events and Calculating Duration in Kaplan-Meier Analysis Using a Single Data Step Hui Song, PRA International, Horsham, PA George Laskaris, PRA International, Horsham, PA ABSTRACT Many studies require contiguous events (e.g., events one or two days apart) being combined before survival analysis. Since individual events overlap in many different ways, it is challenging to effectively (1) combine the contiguous events, (2) calculate their duration (and other time-to-event parameters), and (3) output the combined events only to a dataset. In this work, we present a clean, single-data-step approach that can address the three challenges efficiently. A sample scenario is used to illustrate the process. INTRODUCTION Kaplan-Meier (KM) analysis, or survival analysis, represents a set of statistical methods that estimate lifetime or length of time between two events of a given event-of-interest (or EOI). Survival data are often analyzed in terms of time-to-event parameters, such as time-to-first-event (or onset), time-to-resolution, duration, etc. If there is at least one event as specified in a given EOI, the days to the earliest on-study event will be the time-to-first-event. Note that, if there are no events for the given EOI, time-to-first-event will be the time to the censor date per statistical analysis plan (or SAP). Duration is the difference between the (imputed) start date and the (imputed) end date. If an event does not have an imputed end (or start) date, it should be censored per SAP. For time-to-resolution, only those events that cross the last dose date (i.e., start before the last dose date and end after) are of concern. The calculation of time-to-event parameters is straightforward before combining contiguous events is required for KM analysis. Contiguous events are events that are a short time period apart (e.g., one or two days) defined per SAP. Combining the contiguous events before KM analysis is required in many studies. Given this requirement, the derivation of time-to-event parameters becomes extremely challenging due to the following reasons. First, a given EOI may have hundreds of AE events. In addition, the individual events may overlap in many different ways. One needs to differentiate contiguous events from those that are not so that the combination can be done correctly. Second, calculating the time-to-event parameters (such as duration) on the fly will be difficult given all the different combinations of events. Efficiently outputting the combined events onthe-fly is the third challenge. In this abstract, we present a clean, efficient, single-data-step approach that can combine the contiguous events and calculate the time-to-event parameters on the fly. Especially, we will show how to design and implement the proposed scheme with robustness and efficiency in mind. DESCRIPTION OF METHOD Several approaches can use to address the issues we mentioned above. The straightforward one is to go through all the events (for a given EOI) multiple rounds. For each round, adjacent events that are contiguous will be combined into one. The process continues until no more combination can be done. Some preliminary tests showed that, in order to combine all the events correctly for our study, it might take up to 30 rounds for the algorithm to converge. In this sense, the algorithm s running time is 30*n (where n is the total number of events). This simple solution becomes inefficient when the number of AE events increases to thousands or more. A second approach is to combine the events in groups of two in the first round. In the next round, we combine the two combined-events in the previous round, so on and so forth until the log(n)-th round, in which the algorithm converges. A simple illustrate can be seen below.

Round 1: (1 2) (3 4) (5 6) (7 8) (n-3 n-2) (n-1 n) Round 2: (1 2 3 4) (5 6 7 8) (n-3 n-2 n-1 n) The running time will be n. However, it will be hard to implement due to how data are processed in the data step. (Due to space limitations, more detailed discussion and results are not included in this abstract). In this abstract, we present a method that has a linear running time of n, utilizing SAS s features such as RETAIN statement. In addition, it combines and outputs the combined contiguous events on-the-fly. Two critical techniques used here are RETAIN statement and a look-ahead mechanism. The RETAIN statement is used to maintain some flags across the observations (during a data step), telling SAS whether to check for combination or to output an observation selectively at the end of the data step iteration. The second technique is a look-ahead mechanism. By looking-ahead (to get the AE start and end date of the next event, etc.), the method can make decisions on setting the flags or updating the parameters of the combined events (such as start and end date) when necessary to facilitate the on-the-fly processing. Here is a brief description of the look-ahead mechanism. The one we adapted (SASCOMMUNITY.ORG, 2011) can be described as follows. 1 data ae1; 2 set ae; 3 by subjid; 4 retain ; 5 set ae (firstobs=2 keep=aestdti aeendti 6 rename=(aestdti=next_aestdti aeendti=next_aeendti) ) 7 ae (obs=1 drop=_all_); 8 next_aestdti=ifn(last.subjid, (.), next_aestdti); 9 next_aeendti=ifn(last.subjid, (.), next_aeendti); 10 run; In the DATA step above, we have two SET statements (Line 2 and Line 5), all reading from the same data sets but different observations. In each iteration of the DATA step, the first SET statement (Line 2) reads one observation. The second SET statement (Line 5) has two input data sets. The first one (Line 5) reads the following observation (in Line 2) by specify data set option firstobs=2. In addition, aestdti and aeendti are renamed to next_aestdti and next_aeendti, respectively. The next time the first SET statement is executed, it reads the next observation, and so on. For the last iteration, while reading the very last observation (Line 2), no observation will be left for the data set on Line 5. Note that, the data step ends whenever any of the data sets (in any SET statement) reaches the last record. As a result, SAS will quit the data step without outputting this very last observation (Line 2). The second data set on Line 7 is to make sure that the very last observation is output as well. The look-ahead mechanism makes it possible to decide whether we need to combine with the next event ahead of time. However, the variety of event overlapping makes it hard to decide when the combination is done (so that we can output it). If one overlapping scenario is not considered, the combination may be totally wrong. In addition, since we intend to do it on the fly in one single data step, it will be hard to debug unless we design the algorithm robustly, together with tracking flags for correctness verification. Due to all these considerations, we follow a rigid algorithm design process to implement our method. The process includes four steps: problem statement, algorithm outline, algorithm design, and algorithm implementation. In the following, we describe each step in more detail, followed by a sample scenario.

1. PROBLEM STATEMENT As mentioned above, the problem to solve is to effectively (1) combine the contiguous events, (2) calculate their duration (and other time-to-event parameters), and (3) output the combined events only to a dataset, all on the fly. The following two examples illustrate a subset of the event overlapping scenarios and what the expected results are. Table 1 shows two events after combination: Event 1 consists of A + B + C, where the start date of B is less than two days from the end date of A (similar for C). Event 2 consists of D, which occurs 2 or more days after the end date of C and is therefore a separate event. Table 1. Example Scenario 1 Original Event Study day Event A Event 1 Study Day 1 2 3 4 5 6 7 8 9 A Event B Event 1 Event C Event 1 Event D Event 2 B C D In Table 2, A and B should be combine into one event, Event 1, since they are overlapped. In addition, Event 1 is an unresolved event with duration calculate from minimum of (AE start date for event A, AE start date for event B) to maximum of (censored AE end date for event A, AE end date for event B). Table 2. Example Scenario 2 Original Event Combined Event Study Day 1 2 3 4 5 6 7 8 9 Event A Event 1 Event B Event 1 A B 2. ALGORITHM OUTLINE Fig. 1 shows the notations that we used in this abstract. Given the problem statement before, we sketch the algorithm as follows, which consists of five steps. The first three steps are data preparation, which should be done in a separate data step to adjust aestdti and aeendti per SAP. The sorted data set (referred as ae_sorted below) will then be fed to the algorithm we described in the next subsection (algorithm design). The design, implementation, and testing of the last two steps are the focus of the rest of the abstract.

The Algorithm Outline For all records flagged for an EOI category (e.g., ae.skdcdn=1) do the following: 1) If aestdti is missing then set aestdti to the first dose date. 2) If aeendti is missing or if the aeendti is after data cutoff then censor aeendti to the data cutoff date if the subject is still on treatment, otherwise set to 30 days after last study drug administration date. 3) Sort the records by subjid, aestdti. 4) Increment the event line number by 1, set the event start date to the aestdti of the record, set the event end date to the aeendti, and the event duration (kmdy) to the event end date minus the event start date plus 1. 5) If the difference between the event end date and the next aestdti is less than 2 and the aeendti is greater than the event end date then set the event end date to aeendti and the event duration (kmdy) to the difference between aeendti and the event start date plus 1; if the next aestdti is greater than the event end date by 2 or more then return to Step 4. 3. ALGORITHM DESIGN Given the outline of the algorithm above, we describe the last two steps in SAS pseudo code as below. Fig. 1 presents the notations used in the discussion below. Fig. 1. In each data step iteration, we RETAIN four flags. Notations a) censfln: event status, whether the combined event is a resolved or unresolved event (0: resolved 1: unresolved). If it contains any unresolved events, the com- skdcdn = skin disorder code (1: yes 0: no) ae_sorted = the sorted AE dataset bined event is treated as unresolved. Otherwise, it is fdosdt = first dose date resolved. ldosdt = last dose date b) fstdt: the start date of the first event among all events of a combined event. In other words, it is the minimum (aestdti) of all events within a combined one. aestdti aeendti kmstdti = imputed ae start date = imputed ae end date = start date of the combined ae c) lstdt: the maximum end date (aeendti) of all the event of a combined event. d) contfln: this is an important flag, which tell SAS whether try to combine the current event with the previous one. By default, it sets to be 1. It will be set to 0 if the combination will stop at the current observation. This flag is also used to decide which observations will be output. In this abstract, all observations with contfln=0 will be output (as a combined event). We will see more clearly how it is used in the algorithm pseudo code below. We introduced two auxiliary variables, next_aestdti and next_aeendti, to make it easier to compare the start and end date of two adjacent events in the data set. The core algorithm is presented in three separate figures, Fig. 2, 3, and 4, due to the page size limitation. The algorithm consists of four major components: RETAIN statement, Case 1, Case 2, and combined event output. The algorithm is written in SAS pseudo code and is pretty self-explained. Here we summarize each of them and discuss potential issues that need carefully consideration. Fig. 2 shows the first two components, which are simple. The RETAIN statement retains the four flags we stated above when the data step goes through iterations. CASE 1 handles the situation where a subject has only one event (or record). In such a case, contfln is set to be zero since the next event should not be combined with the current one. The censfln is set to zero since this is a resolved event. CASE 2, the most complex part, is presented in Fig. 3 and Fig. 4. It is divided into three conditions. Note all observations (events) need to be check for Condition 1. Condition 2 and 3 are mutual exclusive. In other words, one observation will fit in either Condition 2 or Condition 3, both not both. kmendti kmdy = end date of the combined ae = event duration Fig. 2 Algorithm Pseudo Code (part 1 of 3) DATA STEP BEGIN; SET ae_sorted; *the sorted AE dataset; RETAIN censfln fstdt lstdt contfln; CASE 1: the subject has only one record if first.subjid and last.subjid then do; *no event combination is needed; contfln=0; censfln=0; * set flags; *end of Case 1; 4

Fig. 3 Algorithm Pseudo Code (part 2 of 3) DATA STEP (continued); CASE 2: if the subject has more than one record Condition 1: first record of a subject, reset flags if first.subjid then do; contfln=1; *set continued flag to 1; fstdt=aestdti; lstdt=aeendti; *initialization; censfln=0; *set resolved flag to 0 (resolved); Condition 2: check whether should be combined if contfln=1 and not last.subjid then do; a. should combine with next observation if (next_aestdti-kmendti<2 or next_aestdti<=lstdt+1) then do; update fstdt/lstdt; update kmstdti/kmendti; if aeendti<=next_aeendti then kmendti=next_aeendti; if next_aeendti>lstdt then lstdt=next_aeendti; b. should not combine with next observation else do; kmstdti=fstdt; if kmendti<lstdt then kmendti=lstdt; kmdy=kmendti-kmstdti+1; contfln=0; censfln=0; * set flags: aestdti/aeendti --> fstdt/lstdt; *end of Condition 2; Fig. 4 Algorithm Pseudo Code (part 3 of 3) DATA STEP (continued); CASE 2: (continued) Condition 3: event combination may not needed if contfln=0 or last.subjid then do; a. if contfln=1 and last.subjid then do; fstdt/lstdt --> kmstdti/kmendti; kmdy=kmendti-kmstdti+1; contfln=0; censfln=0; *set flags; aestdti/aeendti --> fstdt/lstdt; b. if (not last.subjid and contfln=0) and (next_aestdti-kmendti<2 or next_aestdti <= lstdt+1) then do; *should be combined; contfln=1; *set continued flag to 1; fstdt/lstdt-->aestdti/aeendti; censfln=0; *set resolved flag to 0 (resolved); kmstdti/kmendti-->fstdt/lstdt if aeendti<=next_aeendti then kmendti=next_aeendti; if next_aeendti>lstdt then lstdt=next_aeendti; c. else do; kmstdti=fstdt; if kmendti<lstdt then kmendti=lstdt; kmdy=kmendti-kmstdti+1; contfln=0; censfln=0; * set flags: aestdti/aeendti --> fstdt/lstdt; *end of Condition 3; *end of Case 2; OUTPUT THE COMBINED EVENTS if contfln=0; DATA STEP END; 5

The three conditions for CASE 2 are listed below: Condition 1: first record of a subject, reset flags Condition 2: check whether should be combined Condition 3: event combination may not needed Condition1 handles the situation where the observation is the first event for a new subject. Since we have a new subject now, all the flags need to be reset appropriately. contfln is set to 1. By default, we assume combination is needed. fstdt and lstdt keep the minimum aestdti and maximum aeendti seen by far. It is used for checking whether event combination is necessary in the iteration process. They are initialized to the aestdti and aeendti of the first event of a new subject. Finally, we set censfln to be 0, assuming resolved. Note the first observation of a new subject should still be checked for Condition 2 or Condition 3. Condition 2 describes the situation where the current observation should be combined with the previous event. We will also check whether it should be combined with the next event. The checking results two branches in the program (branch a and b as seen in the Fig. 3). In Branch a, we need to update fstdt, lstdt, kmstdti, and kmendti, respectively. The latter two (kmstdti and kmendti) are what we kept in the final output for the start and end date of the combined event. Thus, they need to keep earliest start date and latest end date of the contiguous events. The former two should be updated as well so that the combination can be done correctly if combining is still needed for the next event (the one we look-ahead at the current observation). In Branch b, we prepare the current observation for final output, since this is the last event of a series of contiguous events. The kmstdti and kmendti are set accordingly (kmstdti=fstdt; if kmendti<lstdt then kmendti=lstdt). kmdy, the duration, is calculated given kmstdti and kmendti. Then we set contfln to be 0 should it will be output at the end of the data step. Finally, we reset the rest of the flags, as in CASE 1, except contfln. The reason that contfln is not reset to 1 is because it is used for two purposes. First, it is used to signal whether we should combine with the previous event (note contfln is retained). Second, it is also used for output: if 0 output the observation. Otherwise, do not output since more combination may be needed. Bear this in mind as we proceed to Condition 3. You will see we need to check and take care of contfln flag carefully. Now let us look at Condition 3, which has three branches. Branch a is the case where it s the last observation of a given subject. Thus, no further combination is possible. It is processed as in the second branch of CASE 2. Branch b is for the case where the event is the first event for a possible new contiguous event series. The current value of contfln is zero because the previous event is the last observation of a contiguous event and contfln is set to be zero for output. That is also why in this branch we need to reset contfln to be one again. In some sense, this branch is equivalent to Condition 1 and Branch a of Condition 2. Finally, Branch c handles the rest of the situation, where the event should be set for output, as before. The last component of the algorithm is for outputting. If the contfln flag is set to be zero, the observation will be output to the final combined event dataset. It is this flag (contfln) that makes it possible to combine and output combined events on the fly. It is also this flag that needs carefully handling as we seen in CASE 2. 4. ALGORITHM IMPLEMENTATION Our implementation is done in SAS 9.1.3. Nevertheless, the algorithm applies to any SAS versions. Given the pseudo code above, the implementation of the algorithm in SAS is straightforward and will not be discussed further. 6

THE SAMPLE SCENARIO AND RESULTS Table 3 shows the AE events for a given EOI (skin disorder) for a subject. We will use this sample scenario to illustrate the presentation of our algorithm. Table 3. Sample AE Event Data for Skin Disorder Event SUBJID FDOSDT LDOSDT AESTDTI AEENDTI SKDCDN 1 ABC-XYZ-001 30-Jan-07 22-May-07 6-Feb-07 12-Feb-07 1 2 ABC-XYZ-001 30-Jan-07 22-May-07 12-Feb-07 2-Apr-07 1 3 ABC-XYZ-001 30-Jan-07 22-May-07 19-Mar-07 16-Apr-07 1 4 ABC-XYZ-001 30-Jan-07 22-May-07 2-Apr-07 2-Apr-07 1 5 ABC-XYZ-001 30-Jan-07 22-May-07 2-Apr-07 16-Apr-07 1 6 ABC-XYZ-001 30-Jan-07 22-May-07 9-Apr-07 16-Apr-07 1 7 ABC-XYZ-001 30-Jan-07 22-May-07 29-Apr-07 7-May-07 1 Fig. 5 is an illustration of the events to be combined. As can be seen, the seven events should be combined into two events. Fig. 5. AE Events in Timeline Event 2/1 2/6 2/12 3/1 3/19 4/2 4/9 4/16 4/29 5/7 1 2 3 4 5 6 7 Table 4 shows the results with all flags information kept for illustration (note, subjid is not displayed). Table 4. Contiguous Event Combination Results Event KMSTDTI KMENDTI KMDY CENSFLN CONTFLN NEXT_AESTDTI NEXT_AEENDTI 1 2/6/2007 2/12/2007 7 0 1 12-Feb-07 2-Apr-07 2 2/12/2007 4/2/2007 50 0 1 19-Mar-07 16-Apr-07 3 3/19/2007 4/16/2007 29 0 1 2-Apr-07 2-Apr-07 4 4/2/2007 4/2/2007 1 0 1 2-Apr-07 16-Apr-07 5 4/2/2007 4/16/2007 15 0 1 9-Apr-07 16-Apr-07 6 2/6/2007 4/16/2007 70 0 0 29-Apr-07 7-May-07 7 4/29/2007 5/7/2007 9 0 0.. According to our algorithm, only the last two rows (highlighted) will be output (where contfln=0). 7

CONCLUSIONS In this abstract, we presented our one-data-step process that can merge contiguous events and calculate duration on the fly and output those combined events only. We showed a four-step approach to design and implement the algorithm in a robust way. In the algorithm, we use one time-to-event parameter, duration, to illustrate our idea. In fact, other time-to-event parameters can also be included in the calculation when necessary (such as event free days that should be subtracted from the duration). We also used a sample scenario to illustrate our algorithm. The algorithm has been proved to be efficient and robust in our successfully finished study. Note that, there are many ways to combine the contiguous events. This abstract just showed one of them. REFERENCES: SASCOMMUNITY.ORG, Look-Ahead and Look-Back, http://www.sascommunity.org/wiki/look- Ahead_and_Look-Back (accessed August 21, 2012) ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Hui Song PRA International Inc. 630 Dresher Road Horsham, PA 19044 Work Phone: 215-444-8583 Email: SongHui@PRAintl.com George Laskaris PRA International Inc. 630 Dresher Road Horsham, PA 19044 Work Phone: 215-444-8575 Email: LaskarisGeorge@PRAIntl.com * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 8