Redo Log Process Mining in Real Life: Data Challenges & Opportunities

Size: px

Start display at page:

Download "Redo Log Process Mining in Real Life: Data Challenges & Opportunities"

Elijah Hodge
6 years ago
Views:

1 Redo Log Process Mining in Real Life: Data Challenges & Opportunities E. González López de Murillas 1, G.E. Hoogendoorn 1, and H.A. Reijers 1,2 1 Department of Mathematics and Computer Science Eindhoven University of Technology, Eindhoven, The Netherlands 2 Department of Computer Science Vrije Universiteit Amsterdam, Amsterdam, The Netherlands e.gonzalez@tue.nl, g.e.hoogendoorn@student.tue.nl, h.a.reijers@tue.nl Abstract. Data extraction and preparation are the most time-consuming phases of any process mining project. Due to the variability on the sources of event data, it remains a highly manual process in most of the cases. Moreover, it is very difficult to obtain reliable event data in enterprise systems that are not process-aware. Some techniques, like redo log process mining, try to solve these issues by automating the process as much as possible, and enabling event extraction in systems that are not process aware. This paper presents the challenges faced by redo log, and traditional process mining, comparing both approaches at theoretical and practical levels. Finally, we demonstrate that the data obtained with redo log process mining in a real-life environment is, at least, as valid as the one extracted by the traditional approach. Key words: Process Mining, Databases, Redo Logs, Event logs, Data Quality. 1 Introduction Data extraction and preparation are among the first steps to take in any business intelligence or data analysis project. In many cases, up to 80% of the time and effort, and 50% of the cost is spent during the data extraction and preparation phases [1]. This is due to the fact that the original sources of data come in great variety, differing in structure depending on the nature of the application or process under study. The standardization of this phase represents a challenge, given that a lot of domain knowledge is usually required in order to carry it out. It is because of this that most of the work is done by hand, in an ad-hoc fashion, requiring a lot of iterations in order to obtain the proper data in the right form. In process mining the situation is not much different. Studies have been carried out, focusing on SAP [2, 3, 4], or ERPs in general [5]. Also, efforts have been made to achieve a certain degree of generalization with the tool XESame [6], which assists in the task of defining mappings between database fields on the one side, and events, traces and logs on the other. However, these solutions, which we refer to as part of the classical or traditional approach, are tightly coupled to the specific IT system or data schema they were designed to analyze. Moreover, they do not support the extraction of event data from systems that are non-process aware and do not explicitly record historical information. For this reason, other techniques exist that try to leverage on the existence of alternative sources of data. A very promising approach is redo log process mining [7]. Most modern relational database management systems (RDBMSs) implement different mechanisms to ensure consistency and fault tolerance. One of these mechanisms is redo log recording, which

2 2 E. González López de Murillas et al. consists of a set of files in which database operations are recorded before being applied to the actual data. This allows to rollback the state of the database to previous points in time, undoing the last operations recorded in the redo log files. Redo log process mining exploits the information stored in database redo log files in order to obtain event data. This event data can be analyzed to understand the behavior of processes interacting with the database. One of the benefits of this approach is its independence of the specific application or process in execution, being able to extract behavioral information from both process and non-process aware systems. Also, the event extraction is carried out automatically, without the need for domain knowledge to know how to build events from database tables, as is the case in the traditional approach. However, the prerequisites of this approach are that (a) the redo log system needs to be explicitly configured and enabled in order to record the events, and that (b) special database privileges are required to be able to read the content of the redo log files from the RDBMS. With respect to the traditional and redo log process mining approaches, we face two main questions. (1) Is redo log process mining feasible in a real-life environment? (2) Are the results of both approaches comparable in terms of data quality? Based on our intuition and experience with sample datasets, we propose the following hypothesis: the data obtained by the redo log process mining approach is at least as rich as the data obtained by traditional methods. The goal of this paper is to answer these questions and find support for this hypothesis by comparing the results of both process mining approaches on a real-life dataset. The content of this paper is based on the work developed in [8] as part of one of the author s Master project. The remainder of this paper is organized as follows. First, Section 2 provides some background on the event data extraction techniques about to be compared. After that, a theoretical comparison is presented in Section 3. Then, Section 4 proposes the practical comparison, introducing the business case, explaining the execution of the data extraction, and showing the results. Section 5 compares the results of the application of both approaches, discussing their validity and equivalence. Finally, Section 6 presents the conclusion of this paper. 2 Background We want to compare two approaches for event data extraction: traditional, and redo log process mining. These two approaches differ with respect to the source of data, as well as the procedure they follow to extract it. This section provides some background on the particularities of both approaches, explaining the process to follow for their application, while focusing on the data extraction and processing stages. 2.1 Traditional Process Mining In traditional process mining, event logs are constructed from the plain files or the database tables of the IT system under study. The main event attributes (activity name, case id, timestamp, etc) are identified by hand, while making use of domain knowledge, and extracted in order to build an event log. This is a rather laborious task, as described in the procedure in [9], but very common during the first stages of a process mining project. In some scenarios, data is obtained directly from the original IT systems that drive the process being analyzed. On other occasions, data has already been preprocessed and gathered in data warehouses or similar systems, alleviating somewhat the data extraction issue. In these cases, the work of extracting and processing the data cannot be avoided altogether; it must be performed in a previous phase. The complexity of the task is tackled before the analysis

3 Redo Log Process Mining in Real Life: Data Challenges & Opportunities 3 Table 1: First three steps of the PM 2 process mining methodology. Stage 1: Planning Stage 2: Extraction Stage 3: Data Processing Selecting business processes Determining scope Creating views Identifying research questions Extracting event data Aggregating events Composing project team Transferring process knowledge Enriching logs Filtering logs is done, but the decisions made during the data warehouse design can dramatically affect the kind of analysis that can be performed on the resulting data. In order to apply process mining techniques, it is required to have access to event data that includes, at least, timestamps, activity names, and case identifiers. However, not all the data models of data warehouses guarantee that these aspects of data are being preserved. In order to ensure that enough information is being collected, process-aware meta models like the one proposed in [10] can be adopted. Regardless of the location of the data, it is necessary to obtain a valid event log in order to do the process mining analysis. Different methodologies exist in the literature that describe the steps to take in a process mining project. For our purpose, we decided to focus on the process mining methodology PM 2 [11], a recent methodology that covers all the stages in the life cycle of a process mining project, which has been verified in a real-life environment. PM 2 divides the project in six stages: planning, extraction, data processing, mining & analysis, evaluation, and process improvement & support. Given that we are interested in obtaining an event log, we must focus on the first three stages: planning, extraction, and data processing. Each of these stages has sub-steps as described in Table 1. In traditional process mining, these three stages are carried out manually by the analyst or the process mining team. Usually, these stages require substantial domain knowledge to define the business questions, select the right database tables, determine the case notion, and include interesting event and case attributes, among other tasks. This domain knowledge is often obtained through interviews with the process owners and users. The data is usually retrieved from database tables, executing SQL queries to build the events and, finally, extract the event logs. However, the quality of the event data that can be obtained is constrained by the existence of historical information, timestamps, status changes, modifications, additional attributes, etc. As has been noted before, the structure of the data model strongly determines the usefulness of the resulting event logs. Other event data retrieving techniques, such as redo log process mining, try to mitigate these issues exploiting the existence of historical data automatically recorded by the database systems. Section 3 presents some of the challenges to face with the traditional process mining approach, and compares them to the ones faced by redo log process mining. 2.2 Redo Log Process Mining Redo log process mining is a more automatic technique than the traditional approach. It requires less domain knowledge, and is independent of the system under study. It tries to exploit the execution information stored in database redo logs in order to extract event data. The database redo log system is a functionality of the database management systems that, in order to ensure consistency and fault tolerance, records all the data modification actions executed on the database before they are actually applied. Generally, a set of files are configured to store the redo logs. The RDBMS stores the actions in the redo log files and, when a file is full, it passes to the next file. When all the redo log files are full (according to a specific maximum size), the first file of the set is overwritten. This means that only a recent window of events could be retrieved from the redo logs when assuming a default setting. However,

4 4 E. González López de Murillas et al. database systems usually allow to archive the completed redo log files in a separate location for subsequent analysis. This is a crucial aspect to take into account in order to collect enough data to perform a meaningful analysis. In general, any modification action on the database is recorded in the redo logs. This means that we cannot only observe insert, update, and delete operations performed on every piece of data, but also modifications on the data schema, transactions, rollback and commit operations, etc. The main advantage of this technique is that it allows to analyze systems that are not process aware and do not explicitly record any execution information. Also, deleted data, not present in the database anymore, can be recovered from the redo logs. This has a great value from the forensic audit point of view. On the other hand, the technique presents challenges in terms of data availability, permissions and performance. The following section explores some of these difficulties and compares them to the ones faced by traditional process mining. 3 Theoretical Comparison In the previous sections we have described the fundamentals of both the traditional and redo log process mining approaches. In this section we point out the main differences from a theoretical point of view, clarifying the challenges to face in order to apply either technique in a process mining project. Table 2: Requirements for traditional and redo log process mining approaches Aspect Traditional PM Redo Log PM Data Timestamps Required Guaranteed elements Case Notion Required Required Activity names Required Guaranteed Technical Event recording Application dependent Automatic aspects Completeness of data Desirable Desirable DB read access Required Required Special privileges Not Required Required Snapshot of DB Desirable Required Table 2 shows the requirements of both approaches with respect to the availability of data elements, and some technical aspects to take into account. For each data element, the approaches present different levels of exposure. Something is required when it must be explicitly recorded and available in the database tables. If it is guaranteed, this means that it is assured to be available, regardless of the data schema or the application under study. With respect to technical aspects, something is required when it must be available at the extraction time. If it is automatic, this means that it is guaranteed to be available. Desirable means that it will positively affect the data quality, but is not critical for the technique to work. An aspect is application-dependent when it depends on the application under study to be available; therefore, some uncertainty exists. Finally, an aspect is not required if it is not necessary for the technique to work and, in fact, will not affect the quality of data. We will discuss these elements in more detail soon. 3.1 Data Elements Looking at the top part of Table 2, we can identify that several data elements are needed to extract event logs at all, and we can see how differently these approaches obtain them. The

5 Redo Log Process Mining in Real Life: Data Challenges & Opportunities 5 presence of timestamps, a case notion, and activity names are required by the traditional approach. This means that these elements must be recorded by the application and be available in the database tables at the moment of extraction. This represents the first and most important challenge to face in a process mining project. Without these three elements, we cannot construct events and, therefore, no event log. In case that these elements are not explicitly available, we cannot apply the traditional approach, and must find different ways to obtain events. Redo log process mining has a partial solution for this situation. Thanks to the automatic recording of redo logs by the RDBMS, we can automatically obtain database events, which contain timestamps, activity names, and implicitly within the data, one or several case notions. 3.2 Technical Aspects With respect to the technical aspects, the first challenge to face is the actual event recording. As mentioned before, in traditional process mining we depend on the application to actively record the events and store them in the database tables. Without this, we cannot build the event logs. However, redo log process mining relies on the automatic recording of events in the database redo logs. The fact that this is an automatic system means that event recording in redo logs is application-independent. Yet, it needs to be enabled. Many RDBMSs have this functionality, but it is not properly configured or even enabled by default. Therefore, despite being automatic, it is useless if it is not activated. The events in redo log process mining will be available as long as the recording is enabled, properly configured, and the redo logs are archived instead of being overwritten in a rotary manner. Due to different reasons, the completeness of data available to be extracted cannot be guaranteed in either of the approaches. Missing events would lead to incomplete traces that could affect the quality of the resulting analysis. With respect to the traditional approach, incomplete data can be caused by clean-up activities performed in the database, removing batches of historical information for space saving purposes for example. Also, recording failures could cause completeness issues in the data. Based on our experience, this problem is present even more often when dealing with redo logs. As pointed out previously, redo log recording needs to be enabled and properly configured in order to work well for our purpose. The redo logs will only start to be recorded from the moment they are enabled. Any event that happened before that moment will be unknown to us. Also, the redo log archiving must be configured so the redo logs do not get overwritten or discarded. If that is not the case, gaps in the data could appear. This would be the reason for incomplete or missing traces that will affect the quality of the resulting event log. Normally, when extracting event data, read access to the database is required in order to execute queries and read the content of tables. This requirement is independent of the technique used for the data extraction. Nonetheless, because of how critical the original files are, the redo log approach needs special privileges in order to load and read the content of redo log files. These privileges are not easy to obtain when dealing with production systems in a real-life environment. In our experience, it is safer to perform the data extraction from a cloned instance of the database system. This can be desirable for traditional process mining as well, but here it would not be critical since the extraction method is less computationally intensive and intrusive than redo log process mining. Additionally, the extraction of events from redo logs has another relative drawback: it requires a snapshot of the database. This is due to the fact that the events recorded correspond to insertions, modifications or deletions of rows, and only the affected fields are reflected in the events. Therefore, unless we possess the complete set of redo log files since the system

6 6 E. González López de Murillas et al. creation (which is extremely rare), it is not possible to reconstruct the content of the additional fields exclusively from the redo logs. To solve this issue, a snapshot of the database content is required, such that the values of the missing fields can be queried. To summarize, the main challenges to face when extracting event data from a database system are determined by (a) the presence of the event data in the database, (b) the correct configuration of the event recording systems, and (c) the access and connectivity to the data systems with sufficient privileges to obtain the necessary information. Until now, we have explained the particularities of two data extraction approaches, together with the challenges they face, at a theoretical level and in a very general way. The next section presents a practical comparison performed with data from a real-life system, using both data extraction approaches, to see how these issues work out in real life. 4 Practical Comparison In the previous sections, the advantages and challenges of extracting events from redo logs with respect to database tables have been presented. However, these claims have no value without a proper validation. The aim of performing a case study with both traditional and redo log process mining approaches in this section is twofold. First, to show that applying redo log process mining in a real-life scenario is possible. Second, to demonstrate that, in situations that satisfy certain minimum requirements, the results of redo log process mining are of at least as much quality as the ones obtained from the traditional approach. 4.1 Business Case In order to carry out this case study in a fair manner, it was important to select a system that fulfilled the minimum requirements of both process mining techniques. That is, a system that explicitly records events in the database tables, and that allows to enable redo log recording at the RDBMS level. The software system selected for this study is the OTRS 1 ticketing system. OTRS is a web-based open source process aware information system, commercialized by the OTRS Group, used for customer service, help desk, and IT service management. It offers ticket creation and management, automation, time management, and reporting among other functionalities. The specific instance of OTRS to be analyzed is a production installation within a well known ICT company set in The Netherlands. The company has been using this instance of OTRS for at least for two years now, since the end of 2014, with the purpose of managing the incidents of the IT systems of their clients. In fact, only a subset of the whole plethora of functionalities that OTRS offers are being actively used within the company. In the daily use of the OTRS system, customers send messages reporting issues. This triggers the creation of tickets in the system, that will be followed up by IT specialists. After some interaction between customers and specialists, trying to determine the root cause of the issue, the ticket status will evolve until it gets, hopefully, solved. The goal of the system is to help the company with their customer support in order to maintain a high level of service availability and quality. There are several reasons to choose this specific instance of the OTRS ticketing system. First, the fact that it is a PAIS makes it very attractive to apply the traditional process mining approach. In addition to that, it runs on an Oracle RDBMS, with the possibility to enable redo 1 OTRS:

7 Redo Log Process Mining in Real Life: Data Challenges & Opportunities 7 Table 3: Steps in the execution of the traditional and redo log process mining approaches to obtain an event log. Traditional PM Redo Log PM 1. Query the database (SQL Developer) 1. Connection to DB (PADAS) 2. View of events and cases (SQL Developer) 2. Extraction of Data Model (PADAS) 3. Export log to disk (SQL Developer) 3. Extract events for each table (PADAS) 4. Add trace attributes to log (RapidProM) 4. Build log (PADAS) 5. Load log for analysis (ProM) 5. Export log to XES format (PADAS) 6. Load log for analysis (ProM) log recording, which is a basic requirement to apply redo log process mining. Also, the system was being used in production, with real-life customers. And finally, the company owning the instance was interested in applying process mining to asses the quality of their service. This means that they were willing to cooperate and provide access to the required data and domain knowledge to carry out this case study. The next section describes the execution of the study and how both process mining approaches were applied on the OTRS data. 4.2 Execution To obtain an event log from the system under study, it is necessary to follow a specific set of steps, depending on the approach used to extract the event data. However, in both cases, first we must define the scope of the analysis. The company is interested in answering business questions related to the incident solving process. In particular, these related questions are about the service-level agreements (SLAs) they have with their customers. When looking at the data model 2 of the OTRS system, we observe that the table TICKET plays a central role in the general schema. This table contains the main attributes of a ticket in OTRS. Also, the table TICKET HISTORY holds the historical information related to each ticket. This means that the changes in the tickets are stored in the form of events in that table. Additionally, messages and extra data linked to each ticket is stored in the table ARTICLE. In conclusion, we consider the table TICKET as the case table, and TICKET HISTORY and ARTICLE as event tables. With the scope being defined, it is possible to proceed with the data extraction to build an event log. Starting with traditional process mining, we executed the steps in the left column of Table 3. The details regarding the execution of these steps are outside the scope of this paper. However, extensive information about the full study can be found in [8]. The result has been an event log of which the characteristics can be observed in Table 4, under the column Traditional PM. The data extraction process for the redo log process mining approach differs from the traditional mainly in the source of data, which are the redo log files instead of the database tables. This means that special tools need to be used, in this case the Process Aware Data Suite 3 (PADAS). This tool allows to connect to an Oracle database, and is able to extract the data model, and the events contained in the redo log files for any table of the schema. Also, once the events have been extracted, the tool supports the log creation step, grouping events in traces according to the desired case notion. More details on the log building creation are available in [7]. The steps followed in the data extraction and log building phase for the redo log approach are listed under the right column of Table 3. The log exported from the PADAS tool presents the characteristics observable in Table 4, under the column Redo Log PM database.png 3 PADAS: egonzale/projects/padas/

8 8 E. González López de Murillas et al. 4.3 Results To discuss the results, we will take a look at the aspects of the event logs obtained by traditional and redo log process mining, to evaluate their main differences. Analyzing Table 4, it is clear that there is a big difference on the covered period of time, as well as to the size of the event logs obtained by the two data extraction approaches. The redo log data is not as extensive as the one obtained by the traditional method. This is due to the fact that the redo log recording on the Oracle database hosting the OTRS data schema was enabled at the beginning of the project, around March 2016, and continued until July of the same year. However, the traditional approach was able to extract all the events in the TICKET HISTORY table, which was never deleted or purged since the OTRS system was setup at the end of That is the main reason for the big difference in data quality between both approaches. Table 4: Metrics of the resulting logs for both approaches on all the available data. Metric Traditional PM Redo Log PM Time window captured (days) Magnitude (# of cases) Support (# of events) Number of distinct event classes Granularity of timestamps seconds seconds Fig. 1: Missing archived logs over time in Shaded areas indicate the availability of archived logs, and white areas indicate the gaps Table 5: Metrics of the resulting logs for the period from June 17th to July 12th. Metric Traditional PM Redo Log PM Time window captured (days) Magnitude (# of cases) Support (# of events) Number of distinct event classes Granularity of timestamps seconds seconds Additionally, after observing the resulting event log from the redo log process mining approach, one more data quality issue was identified. Big time gaps were spotted in the extracted data, as shown in Figure 1. However, this problem did not exist in the data obtained by the traditional approach, which was complete. Further investigation of the root cause showed that the reason for this was a misconfiguration of the cloned server used in the study. In this server, a daily script would archive the already filled redo log files to a storage location. However, in some cases, a race condition occurred with another script in charge of cleaning up storage for space saving purposes. This caused the loss of redo log files for full days, and consequently incomplete cases and data quality issues. The issue was fixed as soon as it was detected and, fortunately, data continued to be recorded, this time without interruption. In order to ensure a fair comparison of the process mining approaches, the following strategy was adopted: from the time line of redo log data observable in Figure 1, the largest uninterrupted period was selected to be compared between both logs. This period is from June 17th to July 12th. The resulting event logs were then compared, and the metrics are presented in Table 5. The following section provides a discussion on the equivalence of these two event logs, looking at them from the structural and behavioral point of view.

9 Redo Log Process Mining in Real Life: Data Challenges & Opportunities 9 5 Discussion It has been previously stated that the goal of this work is to find support for the hypothesis that says that the data obtained by the redo log process mining approach is at least as rich as the data obtained by traditional methods. Section 3 shows the intuition behind this hypothesis from the theoretical point of view. Then, Section 4 takes a practical perspective on the evaluation, applying both process mining approaches in a real-life environment. The aim of this section is to analyze the results of the practical comparison, in order to support the aforementioned hypothesis, and explain the possible differences between the event logs obtained by both process mining approaches. 5.1 Event Labels Comparison Table 5 shows that, when focusing on a period of time during which data is available for both approaches, the event logs coincide in the number of cases. Also, the number of events extracted by the redo log approach is higher than the amount obtained by the traditional one. However, this does not guarantee that the former is a superset of the latter. To find evidence of it, we have to look at the event labels in both logs. Table 6 shows a list of event labels ordered by frequency for both event logs. At first sight, the event labels seem disjoint. However, further analysis shows that the two most frequent event labels in the redo log process mining event log, namely NewEventNoMsg, and NewEventWithMsg, correspond to the redo log events obtained from the TICKET HISTORY table. This table is the source of events for the traditional process mining approach. In fact, the sum of the frequencies of these two event labels, 5032 and 1310 respectively, is equal to 6342 events, the total number of events in the event log obtained with the traditional approach. The reason for which the 22 event types of one log are grouped in only two in the other one is that, as to the latter, the event classifier is automatically provided by the approach. This classifier takes into account the table in which the event occurred, Table 6: Event labels and frequencies with the default classifiers for the two event logs. Traditional PM Redo Log PM Activity label Freq Rel Freq Activity label Freq Rel Freq Misc % NewEventNoMsg % OwnerUpdate % NewEventWithMsg % StateUpdate % MessagePhoneOrNote % CustomerUpdate % MessageTicketMerged % NewTicket % AutoReplyTicketReceived % SendAgentNotif % NewArticleA % AddNote % NewArticleB % Lock % NewArticleC % Unlock % UpdateMsg-TicketId-Time-User % Merged % UpdateEvent-TicketId-Time-User % TicketLinkAdd % New Note-Customer Agent % FollowUp % NewArticleD % Customer % UpdateEvent-TicketId-Time % SendAutoReply % UpdateMessage-TicketId-Time % SendAnswer % UpdateMessage-TicketId-User % Move % UpdateEvent-TicketId-User % PriorityUpdate % NewMessage-CustomerOrAgent % TypeUpdate % New External % SendCustomerNotif % UpdateMessage-TicketId % TimeAccounting % UpdateEvent-TicketId % SetPendingTime % FromCustomerWithoutCC % SendAutoFollowUp % FromCustomerWithCC % Total % Total %

10 10 E. González López de Murillas et al. and which fields were affected. However, in the traditional approach, the event classifier takes into account the value of the ticket state id field, which maps integer values to the event labels on the left side of Table 6. Therefore, using this event classifier in the events NewEventNoMsg and NewEventWithMsg of the redo log process mining event log, would result in the same set of event labels, with the same frequencies. To be precise, the events from the redo log approach with the label NewEventNoMsg correspond to a subset of the events obtained through the traditional method with the following event labels: Misc, OwnerUpdate, StateUpdate, CustomerUpdate, NewTicket, SendAgentNotification, Lock, Unlock, Merged, TicketLinkAdd, Move, PriorityUpdate, TypeUpdate, SetPendingTime. With respect to the events with the label NewEventWithMsg, they correspond to a subset of the events with the labels: OwnerUpdate, StateUpdate, AddNote, FollowUp, Customer, SendAutoReply, SendAnswer, SendCustomerNotification, TimeAccounting, SendAutoFollowUp. Therefore we see that there is not a 1:n mapping between the event classes obtained by both approaches. On the contrary, it is a n:m relation, with cases like the activity OwnerUpdate from the log of the traditional approach that groups events that can either correspond to the activity NewEventWithMsg or the activity NewEventNoMsg of the log of the redo log approach. It is important to note that the fact that in Table 5 the number of distinct event classes is the same for both logs (22) is just a coincidence. Actually, the real number of event classes in the redo log process mining event log using an appropriate event classifier should be 42, since two of the event classes of this log correspond to the 22 obtained with the traditional method. 5.2 Control Flow Comparison The equivalence of both event logs has been analyzed from the event labels point of view. However, without mining the traces, we cannot guarantee that the two event logs represent equivalent behavior. To check this aspect, we mined the event logs using the same event classifier in both cases. As discussed previously, the event log obtained from the redo logs contains a superset of the events in the one extracted by the traditional approach. In order to compare the behavior of both logs, we must focus on the same subset of activities. Therefore, the event log obtained from the redo log was filtered, to only include events corresponding to the labels NewEventNoMsg and NewEventWithMsg. Then, the same classifier as in the traditional approach was used, so both event logs would have the same set of event classes. After this preparatory step, we mined both logs using Inductive Miner Infrequent. The resulting process models can be observed in Figure 2. From observing both models we see that they mostly represent the same control flow. However, some differences can be spotted immediately. First, the activities Customer and SendAutoReply occur in parallel in Figure 2a, while they are in a sequence in Figure 2b. Second, the activity SendAnswer is part of a choice in Figure 2a, while it happens before the choice in Figure 2b. Third, activities NewTicket and CustomerUpdate always happen in 6th and 5th position from the end of the trace in Figure 2b, while in Figure 2a they can only be executed in mutual exclusion with the bottom part of the process. These differences, though graphically subtle, can mean a big difference in behavior. Fortunately, there is an explanation for them. There are two main reasons for this disagreement in control flow between both event logs. (1) The event timestamps obtained by both approaches are set by different mechanisms. In the traditional approach, the timestamps of each event correspond to the ones written by the OTRS system in the timestamp field of the TICKET HISTORY table. In the redo log approach, the timestamps correspond to the ones recorded by the Oracle RDBMS when processing the SQL statements sent by the OTRS

Redo Log Process Mining in Real Life: Data Challenges & Opportunities 11 (a) Petri net mined for the event log obtained through traditional process mining.

Therefore, a difference in order between the events of the traditional and the redo log approach could occur given that the timestamps in the former corresponds to the behavior enforced by OTRS,

11 Redo Log Process Mining in Real Life: Data Challenges & Opportunities 11 (a) Petri net mined for the event log obtained through traditional process mining. (b) Petri net mined for the event log obtained through redo log process mining. Fig. 2: Process models mined with Inductive Miner, Infrequent (noise threshold = 0.2) system. Therefore, a difference in order between the events of the traditional and the redo log approach could occur given that the timestamps in the former corresponds to the behavior enforced by OTRS, while the timestamps in the latter correspond to the actual execution of the associated statements in the database. (2) The events obtained by the traditional approach correspond to rows in the table TICKET HISTORY of the database, and their content can be modified during the life-cycle of the process. However, the events recorded by the redo log system are immutable, and a modification of a row in TICKET HISTORY would create a new event in the redo log files. In fact, the OTRS system is known to modify the fields TicketID, User, and Time of the TICKET HISTORY rows whenever two tickets are merged together. The presence of the activities UpdateEvent-TicketId-Time, UpdateEvent-TicketId- User, UpdateEvent-TicketId-Time-User, and UpdateEvent-TicketId in the event log obtained from the redo logs is a proof of this behavior. Therefore, after this comparison at both activity label, and control flow level, we can conclude that the behavior captured by the event log produced by the traditional approach is indeed a subset of the behavior captured by the redo log approach, and the latter can be easily filtered in order to achieve a high degree of equivalence.

12 12 E. González López de Murillas et al. 6 Conclusion In this paper, two process mining approaches have been compared with respect to the data extraction phase: traditional process mining, and redo log process mining. The evaluation was performed in a unique setting: both approaches were applied in a real-life environment, on real data from real systems, in order to determine the level of equivalence between the results obtained through both methods. Analyzing the results, we concluded that, when the difficulties to apply the redo log approach are overcome, this method is able to retrieve richer event logs, with a higher quality in terms of number of events, and reliability of the captured behavior. Additionally, it has been shown that traditional approaches are vulnerable to event manipulation, which can alter the results of the analysis, while the redo log approach ensures the immutability of the events, being therefore more robust to data manipulation and fraud. In addition to these benefits, redo log process mining, unlike the traditional approach, can be applied to non-process aware systems, in which events are not explicitly recorded at the application level, but they still use a RDBMS as a data storage. However, this comes at a price. The need for special privileges to configure and enable redo log recording makes it not easy to set up in all environments, while the traditional approach only requires read access to the relevant database tables. All things considered, redo log process mining must be considered as a viable alternative to traditional process mining. As future work, new sources of event data will be explored, in order to tackle the limitations of the redo log approach, and improve the quality of the extracted event logs with respect to traditional methods. References 1. Watson, H.J., Wixom, B.H.: The current state of business intelligence. Computer 40(9) (2007) 2. Ingvaldsen, J.E., Gulla, J.A.: Preprocessing support for large scale process mining of SAP transactions. In: Business Process Management Workshops, Springer (2008) Roest, A.: A practitioner s guide for process mining on erp systems : the case of sap order to cash. Master s thesis, Technische Universiteit Eindhoven, The Netherlands (2012) 4. Segers, I.: Investigating the application of process mining for auditing purposes. Master s thesis, Technische Universiteit Eindhoven, The Netherlands (2007) 5. Yano, K., Nomura, Y., Kanai, T.: A practical approach to automated business process discovery. In: Enterprise Distributed Object Computing Conference Workshops (EDOCW), th IEEE International. (Sept 2013) Verbeek, H., Buijs, J.C., Van Dongen, B.F., van der Aalst, W.M.P.: XES, XESame, and ProM 6. In: Information Systems Evolution. Springer (2011) González-López de Murillas, E., van der Aalst, W.M.P., Reijers, H.A.: Process mining on databases: Unearthing historical data from redo logs. In: Business Process Management. Springer (2015) 8. Hoogendoorn, G.E.: A comparative study for process mining approaches in a real-life environment. Master s thesis, Eindhoven University of Technology (2017) 9. Jans, M.J.: From relational database to valuable event logs for process mining purposes: a procedure. Technical report, Hasselt University (2017) 10. González López de Murillas, E., Reijers, H.A., van der Aalst, W.M.P.: Connecting databases with process mining: A meta model and toolset. In: International Workshop on Business Process Modeling, Development and Support, Springer (2016) van Eck, M.L., Lu, X., Leemans, S.J., van der Aalst, W.M.: PM2: A process mining project methodology. In: International Conference on Advanced Information Systems Engineering, Springer (2015)

Connecting databases with process mining: a meta model and toolset

Software & Systems Modeling https://doi.org/10.1007/s10270-018-0664-7 SPECIAL SECTION PAPER Connecting databases with process mining: a meta model and toolset Eduardo González López de Murillas 1 Hajo