A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants

Size: px

Start display at page:

Download "A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants"

Patrick Cameron
5 years ago
Views:

1 A Heuristic Approch for Discovering Reference Models by Mining Process Model Vrints Chen Li 1, Mnfred Reichert 2, nd Andres Wombcher 3 1 Informtion System Group, University of Twente, The Netherlnds lic@cs.utwente.nl 2 Institute of Dtbses nd Informtion Systems, Ulm University, Germny mnfred.reichert@uni-ulm.de 3 Dtbse Group, University of Twente, The Netherlnds.wombcher@utwente.nl Abstrct. Recently, new genertion of dptive Process-Awre Informtion Systems (PAISs) hs emerged, which enbles structurl process chnges during runtime while preserving PAIS robustness nd consistency. Such flexibility, in turn, leds to lrge number of process vrints derived from the sme model, but differing in structure. Generlly, such vrints re expensive to configure nd mintin. This pper provides heuristic serch lgorithm which fosters lerning from pst process chnges by mining process vrints. The lgorithm discovers reference model bsed on which the need for future process configurtion nd dpttion cn be reduced. It dditionlly provides the flexibility to control the process evolution procedure, i.e., we cn control to wht degree the discovered reference model differs from the originl one. As benefit, we cn not only control the effort for updting the reference model, but lso gin the flexibility to perform only the most importnt dpttions of the current reference model. Our mining lgorithm is implemented nd evluted by simultion using more thn 7000 process models. Simultion results indicte strong performnce nd sclbility of our lgorithm even when fcing lrge-sized process models. 1 Introduction In tody s dynmic business world, success of n enterprise incresingly depends on its bility to rect to chnges in its environment in quick, flexible nd cost-effective wy [22]. However, current off-the-shelf enterprise softwre does not meet these needs [23]. It is deployed in different compnies, domins, nd countries, nd therefore tends to be too generic nd rigid. Generlly, introduction of enterprise softwre entils the problem of ligning business processes nd IT. This cuses huge customiztion efforts t the site of softwre buyers tht exceed the price for the softwre licenses by fctor five to ten [5]. Softwre vendors, in turn, mke endevors to close this lignment gp [34], nd mjor progress hs This work ws done in the MinAdept project, which hs been supported by the Netherlnds Orgniztion for Scientific Reserch under contrct number

2 2 been chieved by shifting from function- to process-centered softwre design. Along this trend vriety of process support prdigms s well s lnguges hve emerged. Using WS-BPEL [1], for exmple, n executble process cn be composed out of existing ppliction services. At runtime, execution of these services is then orchestrted by the PAIS ccording to the defined process logic. Recently, different pproches for dpting processes hve emerged. Generlly, structurl process dpttions re not only needed for configurtion purpose t build time [9, 32], but lso become necessry for single process instnces during runtime to del with exceptionl situtions nd chnging needs [25, 41]. In response to this need dptive process mngement technology hs emerged [41, 43]. It llows to configure nd dpt process models t different levels. This, in turn, results in lrge collections of process model vrints (process vrints for short) creted from the sme process model, but slightly differing from ech other in their structure. Fig. 1 depicts n exmple. The left hnd side shows high-level view on ptient tretment process s it is normlly executed: ptient is dmitted to hospitl, where he first registers, then receives tretment, nd finlly pys. In emergency situtions, however, it might become necessry to devite from this model, e.g., by first strting tretment of the ptient nd llowing him to register lter during tretment. To cpture this behvior in the model of the respective process instnce, we need to move ctivity receive tretment from its current position to the one prllel to ctivity register. This leds to instnce-specific process model vrint S s shown in Fig. 1b. Generlly, lrge number of process vrints derived from sme originl process model exist [22]. 1.1 Problem Sttement Though considerble efforts hve been mde to ese process configurtion nd customiztion [9, 25], most existing pproches hve not yet utilized the informtion resulting from pst process dpttions [40]. Fig. 2 describes the gol of this pper. We im t lerning from pst process chnges by merging process vrints into one generic process model, which covers these vrints best. By dopting this generic model s newreference process model within the PAIS, need for future process dpttions nd thus cost for chnge will decrese. =Move (S, register, dmitted, py) S[ >S register Admitted register receive tretment py dmitted py AND-Split receive tretment AND-Join e=<dmitted, receive tretment, register, py> ) S: originl process model b) S : finl execution & chnge Fig. 1. Originl Process Model S nd Process Vrint S

3 3 When deriving new process reference model, the originl one should be lso tken into ccount. In most cses, drmtic chnges of the current reference model re not preferred due to implementtion costs or socil resons. Process designers should therefore hve the flexibility to choose to wht degree they wnt to chnge the originl reference model to fit better to the vrints. Originl reference process model S customiztion & dpttion Process vrint S1 Process vrint S2 Process vrint Sn Control differences mining & lerning Discovered reference process model S Fig. 2. Discovering new reference model by lerning from pst process configurtions Bsed on the two ssumptions tht (1) process models re well-formed (i.e., block-structured like in WS-BPEL) nd (2) ll ctivities in process model hve unique lbels, this pper dels with the following fundmentl reserch question: Given reference model nd collection of process vrints configured from it, how to derive new reference process model by performing sequence of chnge opertions on the originl one, such tht the verge distnce between the reference model nd the process vrints becomes miniml? The distnce between the reference process model nd process vrint is mesured by the number of high-level chnge opertions (e.g., to insert, delete or move ctivities [25]) needed to trnsform the reference model into the respective vrint. Clerly, the shorter the distnce is, the less efforts needed for process dpttion re. Bsiclly, we discover new reference model by performing sequence of chnge opertions on the originl one. In this context, we provide users the flexibility to control the distnce between old reference model nd newly discovered one, i.e., to choose how mny chnge opertions shll be pplied to the old reference model. Clerly, the most relevnt chnges (which significntly contribute to reduce the verge distnce) should be considered first nd the less importnt ones lst. Prticulrly, if users decide to ignore the less relevnt chnges, the overll performnce of our lgorithm with respect to the described reserch gol will not be influenced too much. Such flexibility to control the difference between the originl nd the discovered model is significnt improvement when compred to our previous work [15]; the pproch presented in [15] enbles discovery of reference process model by mining collection of vrints, but is unble to tke the originl reference process model into ccount.

4 4 The reminder of this pper is orgnized s follows. Section 2 gives bckground informtion needed for understnding this pper. Section 3 introduces our heuristic serch lgorithm nd provides high-level overview on how it cn be used for mining process vrints. We describe two importnt spects of our heuristics lgorithm, (i.e., the fitness function nd the serch tree) in Sections 4 nd 5. To evlute the performnce of our mining lgorithm, we conduct simultion. Section 6 describes its setup, while Section 7 presents the simultion results. Finlly, Section 8 discusses relted work nd Section 9 concludes with summry nd outlook. 2 Bckgrounds We first introduce bsic notions needed in the following: Process Model: Let P denote the set of ll sound process models. A prticulr process model S = (N, E,...) 4 P is defined s well-structured Activity Net [25]. N constitutes the set of ctivities nd E the set of control edges (i.e., precedence reltions) linking them. To limit the scope, we ssume Activity Nets to be block-structured (similr to WS-BPEL). A simple exmple is depicted in Fig. 3. For detiled description nd correctness issues, we refer to [25]. Process chnge: A process chnge is ccomplished by pplying sequence of high-level chnge opertions to given process model S over time [25]. Such opertions modify the initil process model by ltering its set of ctivities nd their order reltions. Thus, ech ppliction of chnge opertion results in new process model. We define process chnge nd process vrint s follows: Definition 1 (Process Chnge nd Process Vrint). Let P denote the set of possible process models nd C be the set of possible process chnges. Let S, S P be two process models, let C be process chnge, nd let σ = 1, 2,... n C be sequence of chnges performed on initil model S. Then: S[ S iff is pplicble to S nd S is the (sound) process model resulting from the ppliction of to S. S[σ S iff S 1, S 2,... S n+1 P with S = S 1, S = S n+1, nd S i [ i S i+1 for i {1,... n}. We lso denote S s vrint of S. Exmples of high-level chnge opertions include insert ctivity, delete ctivity, nd move ctivity s implemented in the ADEPT chnge frmework [25]. While insert nd delete modify the set of ctivities in the process model, move chnges ctivity positions nd thus the structure of the process model. A forml semntics of these chnge ptterns is given in [31]. For exmple, opertion move(s, A,B,C) moves ctivity A from its current position within process model S to the position fter ctivity B nd before ctivity C. Opertion delete(s, A), in turn, deletes ctivity A from process model S. Issues concerning the correct 4 A Well-structured Activity Net contins more elements thn only node set N nd edge set E, which cn be ignored in the context of this pper.

5 5 use of these opertions, their generliztion nd forml pre-/post-conditions re described in [25]. Though the depicted chnge opertions re discussed in reltion to our ADEPT chnge frmework, they re generic in the sense tht they cn be esily pplied in connection with other process met models s well [31, 43]. For exmple, process chnge s relized in the ADEPT frmework cn be mpped to the concept of life-cycle inheritnce known from Petri Nets [37]. We refer to ADEPT since it covers by fr most high-level chnge ptterns nd chnge support fetures when compred to other dptive PAISs [41, 43]. Definition 2 (Bis nd Distnce). Let S, S P be two process models. Then: Distnce d (S,S ) between S nd S corresponds to the miniml number of high-level chnge opertions needed to trnsform S into S ; i.e., we define d (S,S ) := min{ σ σ C S[σ S }. Furthermore, sequence of chnge opertions σ with S[σ S nd σ = d (S,S ) is denoted s bis between S nd S. The distnce between S nd S is the miniml number of high-level chnge opertions needed for trnsforming S into S. The corresponding sequence of chnge opertions is denoted s bis B S,S between S nd S. 5 Usully, such distnce mesures the complexity for model trnsformtion (i.e., configurtion). As exmple tke Fig. 1. Here, distnce between model S nd vrint S 1 is one, since we only need to perform one chnge opertion move(s, register,dmitted,py) to trnsform S into S [17]. In generl, determining bis nd distnce between two process models hs complexity t N P level [17]. We consider high-level chnge opertions insted of chnge primitives (i.e., elementry chnges like dding or removing nodes / edges) to mesure the distnce between process models. This llows us to gurntee soundness of process models nd provides more meningful mesure for distnce [17, 41]. Definition 3 (Trce). Let S = (N, E,...) P be process model. We define t s trce of S iff: t < 1, 2,..., k > (with i N) constitutes vlid nd complete execution sequence of ctivities considering the control flow defined by S. We define T S s the set of ll trces tht cn be produced by process instnces running on process model S. t( b) is denoted s precedence reltionship between ctivities nd b in trce t < 1, 2,..., k > iff i < j : i = j = b. We only consider trces composing rel ctivities, but no events relted to silent ones, i.e., nodes within process model hving no ssocited ction nd only existing for control flow purpose [17]. At this stge, we consider two process models s being the sme if they re trce equivlent, i.e., S S iff T S T S. The stronger notion of bi-similrity [10] is not needed in our context. 5 Generlly, it is possible to hve more thn one miniml set of chnge opertions to trnsform S into S, i.e., given process models S nd S their bis does not need to be unique. A detiled discussion of this issue cn be found in [37, 17].

6 6 3 Overview of Our Heuristic Serch Algorithm Section 3.1 provides running exmple which we use throughout the pper. In Section 3.2, we introduce our heuristic serch lgorithm nd give high-level overview of how it cn be pplied for mining process vrints. 3.1 Running Exmple An illustrting exmple is given in Fig. 3. Out of n originl reference model S, six different process vrints S i P (i = 1, 2,... 6) hve been configured. These vrints do not only differ in structure, but lso with respect to their ctivity sets. For exmple, ctivity X ppers in 5 of the 6 vrints (except S 2 ), while Z only ppers in S 5. The 6 vrints re further weighted bsed on the number of process instnces creted from them. In our exmple, 25% of ll instnces were executed ccording to vrint S 1, while 20% rn on S 2. If we only know the process vrints, but hve no runtime informtion bout relted instnce executions, we ssume vrints to be eqully weighted; i.e., every process vrint then hs weight 1/n, where n corresponds to the totl number of vrints. We cn lso compute the distnce (cf. Def. 2) between originl reference model S nd ech vrint S i. For exmple, when compring S with S 1 we obtin distnce 4 (cf. Fig. 3); i.e., we need to pply four high-level chnge opertions (move(s, H,I,D), move(s, I,J, endf low), move(s, J,B, endf low) nd insert(s, X,E,B); cf. Def. 1) to trnsform S into S 1. Bsed on weight w i of ech vrint S i, we cn then compute verge weighted distnce between reference model S nd its vrints. Regrding our exmple, s distnces between S nd S i we obtin: 4(i = 1,..., 6) 6 (cf. Fig. 3). When considering the vrint weights, s verge weighted distnce, we obtin = 4.0. This mens we need to perform on verge 4.0 chnge opertions to configure process vrint (nd relted instnce respectively) out of the reference model. Generlly, verge weighted distnce between reference model nd its process vrints represents how close they re. The gol of our mining lgorithm is to discover reference model for collection of (weighted) process vrints with miniml verge weighted distnce to the vrints. 3.2 Heuristic Serch for Process Vrint Mining As discussed in Section 2, mesuring the distnce between two models is n N P problem, i.e., the time for computing the distnce is exponentil to the size of the process models. Consequently, the problem set out in our reserch question (i.e., finding reference model which hs miniml verge weighted distnce to the vrints), is n N P problem s well. When encountering rel-life cses (i.e., 6 In our exmple, ll vrints hve the sme distnce to the originl reference model. We specilly designed it in this wy in order to better explin our simultion s presented in Section 6. Clerly, our lgorithm cn lso be pplied when vrints hve different distnces to the originl reference model.

7 7 I A E F G C D J B H S : originl reference model Process configurtion Weight: w 1 = 25% E S F 1 C S 2 Distnce: d (S,S1) = 4 Bies: B (S,S1) = Weight: w 3 = 15% Y C S 3 A S 4 E Distnce: d (S,S3) = 4 Bies: B (S,S3)= Weight: w 5 = 20% A E X Y B J F S 5 G S 6 I H G D X F A X D G I J C Z D I H Distnce: d (S,S5) = 4 Distnce: d (S,S6) = 4 Bies: B (S,S5) = move (S, H, I, D), move(s, I, J, endflow), move (S, J, B, endflow), insert(s, X, E, B) move(s, J, {A,F}, B), insert(s, X, E, J), Insert (S, Y, strtflow, I), move(s, I, D, H) insert(s, Y, {A,F}, B), insert(s, X, E, Y), insert(x, Z, C, D), move (S, J, B, endflow) B B H J Weight: w 2 = 20% Distnce: d (S,S2)= 4 Bies: B (S,S2) = E Distnce: d (S,S4) = 4 B (S,S4)= Bies: B (S,S6) = E Weight: w 4 = 10% Bies: C Weight: w 6 = 10% E insert(s, Y, E, B, con), move(s, C, strtflow, I), move (S, J, B, endflow), move (S, I, D, H) A X F H A F Y C D B G G insert(s, X, E, B), insert(s, Y, strtflow, B), move (S, J, B, endflow), move (S, H, I, C) D move(s, H, strtflow, I), move (S, I, B, endflow), insert(s, X, E, B), move (S, J, B, endflow, con) I A X F Y I G I J B B H C D H J J Averge weighted distnce = 4 chnge / instnce AND-Split AND-Join XOR-Split XOR-Join Fig. 3. Illustrting exmple thousnds of vrints with complex structure), finding the optimum would therefore be either too time-consuming or not fesible. In this pper, we present heuristic serch lgorithm for vrint mining. Our overll gol is to find solution which is close to the optimum, but cn be computed in resonble mount of time. Heuristic lgorithms re widely used in vrious fields of computer science, e.g., rtificil intelligence [20], dt mining [36] nd mchine lerning [24]. A problem employs heuristics when it my hve n exct solution, but the computtionl cost of finding it my be prohibitive [20]. Although heuristic lgorithms do not im t finding the rel optimum (i.e., it is neither possible to theoreticlly prove tht the discovered result is the optimum nor cn we sy how

8 8 close it is to the optimum), they re widely used in prctice. Usully heuristic lgorithms provide nice blnce between the goodness of the discovered solution nd the computtion time for finding it [20]. Regrding the mining of process vrints, Fig. 4 illustrtes on how heuristic lgorithms cn be pplied in our context. Here we represent ech process vrint S i s single node in the two dimensionl spce (white node). The gol for vrint mining is then to find the center of these nodes (bull s eye S nc ), which hs miniml verge distnce to them. In ddition, s discussed in Section 1, we lso wnt to tke the originl reference model S (solid node) into ccount, such tht we cn control the difference between the newly discovered reference model nd the originl one. Bsiclly, this requires us to blnce two forces: one is to bring the newly discovered reference model closer to the vrints (i.e., to the bull s eye S nc t right) thn the old one; the other one is to move the discovered model not too fr wy from originl model S (the solid node t left) such tht it does not differ too much from the originl one. Process designer obtin the flexibility to blnce these two forces, i.e., they re ble to discover model (e.g., S c ), which is closer to the vrints thn the old one but which is still within limited distnce to the ltter. Clerly, the chnge opertions pplied first to the (originl) reference model should be more importnt (i.e., reduce the distnce between the reference model nd the vrints more) thn the ones positioned t end. Consequently, if we ignore the less relevnt chnges, we will not influence the overll distnce reduction between reference model nd vrints too much. Our heuristic lgorithm works s follows: 1. We use originl reference model S s strting point. 2. We serch for ll neighboring process models with distnce 1 to the currently considered reference process model. If we re ble to find better model S mong these cndidte models (i.e., one which hs lower verge weighted distnce to the given collection of vrints thn S), we replce S by S. 3. Repet Step 2 until we either cnnot find better model or the mximlly llowed distnce between originl nd new reference model is reched. Finlly, S corresponds to our discovered reference model S c. If we do not set ny serch limittion, our heuristic lgorithm is lso ble to find the center of the vrints (i.e., S nc ). This implies tht it cn be lso pplied to scenrios where there only exists collection of vrints, but the originl reference model is not known. In this cse, we cn rndomly select vrint S i s strting point nd serch unlimitedly until we find the center, i.e., the model with miniml verge weighted distnce to the collection of vrints. Generlly, most importnt for ny heuristic serch lgorithm re two spects: the heuristic mesure nd the lgorithm tht uses heuristics to serch the stte spce. Section 4 introduces the fitness function which mesures the qulity of prticulr cndidte model; i.e., it llows us to pproximtely evlute how close such cndidte model is to the given vrints. Section 5 then introduces best-first serch lgorithm to serch the stte spce; i.e., this lgorithm illustrtes how to serch for next cndidte process model.

9 9 Force 2: close to reference Force 1: close to vrints S i :Vrints d = 2 d =3 d=1 S: Originl reference model No constrint S c: Serch result with constrint S nc : Serch result without constrint Originl Reference model Process vrints Intermedite serch result Discovered Reference Model Serch steps Fig. 4. Heuristic serch pproch 4 Fitness Function of Heuristic Serch Algorithm Generlly, ny fitness function of heuristic serch lgorithm should be quickly computble. Since serch spce my become very lrge, we must be ble to mke quick decision on which pth to choose next. Averge weighted distnce cn not be used s fitness function since complexity for computing it is N P. In this section we introduce fitness function, which cn be used to pproximtely mesure the closeness between cndidte model nd the collection of vrints. In prticulr, it cn be computed in polynomil time. Like in most heuristic serch lgorithms, the chosen fitness function is resonble guessing rther thn precise mesurement. Therefore, in Section 7 we will investigte how the fitness function is correlted with the verge weighted distnce. 4.1 Activity Coverge For cndidte process model S c = (N c, E c,...) P, we first mesure to wht degree its ctivity set N c covers the ctivities tht occur in the considered collection of vrints. We denote this mesure s ctivity coverge AC(S c ) of S c. Before we cn compute ctivity coverge, we first need to determine the frequency with which ech ctivity j ppers within the collection of vrints. Definition 4 (Activity frequency). Let S i = (N i, E i,...) P, i = 1, 2,..., n be collection of vrints with weights w i nd ctivity sets N i. For ech j n i=1 N i, we define g( j ) s reltive frequency with which j ppers within the given vrint collection. Formlly: g( j ) = w i (1) S i : j N i Tble 1 shows the frequency of ech ctivities contined in ny of the vrints in our running exmple; e.g., X is present in 80% of the vrints (i.e., in S 1, S 3, S 4, S 5, nd S 6 ), while Z only occurs in 20% of the cses (i.e., in S 5 ).

10 10 Activity A B C D E F G H I J X Y Z g( j) Tble 1. Frequency of ech ctivity within the given vrint collection Definition 5 (Activity coverge). Let M = n i=1 N i be the set of ctivities which re present in t lest one of the vrints. Let further N c be the ctivity set of cndidte process model S c. Given ctivity frequency g( j ) of ech j M, we cn compute ctivity coverge AC(S c ) of model S c s follows: AC(S c ) = j N c g( j ) j M g( (2) j) The vlue rnge of AC(S c ) is [0, 1]. Let us tke originl reference model S s cndidte model. It contins ctivities A, B, C, D, E, F, G, H, I, nd J, but does not contin X, Y nd Z. Therefore, its ctivity coverge AC(S), which represents how much it covers the ctivities in the vrint collection, is Structure Fitting Though AC(S c ) mesures how representtive the ctivity set N c of cndidte model S c is with respect to given vrint collection, it does not sy nything bout the structure of the cndidte model. We therefore introduce structure fitting SF (S c ) s nother mesurement. It mesures to wht degree cndidte model S c structurlly fits to the given collection of vrints S i. We first sketch method which llows to represent process model S s order mtrix. Bsed on this, we introduce ggregted order mtrix which llows to represent collection of process vrints s mtrix. In ddition, we introduce the coexistence mtrix which shows the importnce of the order reltions. Finlly, we describe how to mesure structure fitting SF (S c ) of cndidte model S c. Representing Process Models s Order Mtrices One key feture of our ADEPT chnge frmework is to mintin the structure of the unchnged prts of process model [25]. For exmple, when deleting n ctivity this neither influences successors nor predecessors of this ctivity, nd therefore lso not their order reltions. To incorporte this feture in our pproch, rther thn only looking t direct predecessor-successor reltionships between ctivities (i.e. control edges), we consider the trnsitive control dependencies for ech pir of ctivities; i.e. for given process model S = (N, E,...) P, we exmine for ctivities i, j N, i j their trnsitive order reltion. Logiclly, we determine order reltions by considering ll trces the process model my produce (cf. Section 2). Results re ggregted in n order mtrix A N N, which considers four types of control reltions (cf. Def. 6): Definition 6 (Order mtrix). Let S = (N, E,...) P be process model with N = { 1, 2,..., n }. Let further T S denote the set of ll trces producible

11 11 on S. Then: Mtrix A N N is clled order mtrix of S with A ij representing the order reltion between ctivities i, j N, i j iff: A ij = 1 iff ( t T S with i, j t t( i j )) If for ll trces contining ctivities i nd j, i lwys ppers BEFORE j, we denote A ij s 1, i.e., i lwys precedes of j in the flow of control. A ij = 0 iff ( t T S with i, j t t( j i )) If for ll trces contining ctivities i nd j, i lwys ppers AFTER j, we denote A ij s 0, i.e. i lwys succeeds of j in the flow of control. A ij = * iff ( t 1 T S, with i, j t 1 t 1 ( i j )) ( t 2 T S, with i, j t 2 t 2 ( j i )) If there exists t lest one trce in which i ppers before j nd nother trce in which i ppers fter j, we denote A ij s *, i.e. i nd j re contined in different prllel brnches. A ij = - iff ( t T S : i t j t) If there is no trce contining both ctivity i nd j, we denote A ij s -, i.e. i nd j re contined in different brnches of conditionl brnching. Fig. 5 gives n exmple. Besides control edges, which express direct predecessorsuccessor reltionships, the depicted process model S contins different control connectors: AND-Split, AND-Join, XOR-Split nd XOR-join. The depicted order mtrix represents ll these reltions. For exmple, ctivities A nd B will never pper in the sme trce since they re contined in different brnches of n XOR block. Therefore, we ssign - to mtrix element A AB. Similrly, we obtin the reltion for ech pir of ctivities. The min digonl of the mtrix is empty since we do not compre n ctivity with itself. Under certin conditions, n order mtrix uniquely represents the process model it ws creted from [17]. Anlyzing its order mtrix (cf. Def. 6) is then sufficient for nlyzing the corresponding process model. B D A C E G F 0 : successor 1 : predecessor * : AND-block - : XOR-block AND-Split XOR-Split AND-Join XOR-Join ) Process model S b) Order mtrix of S Fig. 5. ) Process model nd b) relted order mtrix Aggregted Order Mtrix For given collection of process vrints, first, we compute the order mtrix of ech process vrint (cf. Def. 6). Regrding our

12 12 running exmple from Fig. 3, we need to compute six order mtrices (cf. Fig. 6). Note tht we only show prtil view of the order mtrices here, (ctivities H, I, J, X, Y nd Z) due to spce limittions. Afterwrds, we nlyze the order reltion for ech pir of ctivities considering ll order mtrices derived before. As the order reltion between two ctivities might be not the sme in ll order mtrices, this nlysis does not result in fixed reltion, but provides distribution for the four types of order reltions (cf. Def. 6). Regrding our exmple, in 65% of ll cses, H is successor of I (s in S 2, S 3, S 4 nd S 6 ), in 25% of ll cses H is predecessor of I (s in S 1 ), nd in 10% of ll cses H nd I re contined in different brnches of n XOR block (s in S 4 ) (cf. Fig. 6). Generlly, to collection of process vrints we cn define the order reltion between ctivities nd b s 4-dimensionl vector V b = (vb 0, v1 b, v b, v b ). Ech field then corresponds to the frequency of the respective reltion type ( 0, 1, * or - ) s specified in Def. 6. Tke our running exmple nd consider Fig. 6. Here, vhi 1 corresponds to the frequency of ll cses with ctivities H nd I hving order reltionship 1, i.e., ll cses for which H precedes I; we obtin V HI = (0.65, 0.25, 0, 0.1). Formlly, we define n ggregted order mtrix s follows: Definition 7 (Aggregted Order Mtrix). Let S i = (N i, E i,...) P, i = 1, 2,..., n be collection of process vrints with ctivity sets N i. Let further A i be the order mtrix of S i, nd let w i represent the number of process instnces being executed bsed on S i. The Aggregted Order Mtrix of ll process vrints is defined s 2-dimensionl mtrix V m m with m = N i nd ech mtrix element v jk = (vjk 0, v1 jk, v jk, v jk ) being 4-dimensionl vector. For τ {0, 1,, }, element vjk τ expresses to wht percentge, ctivities j nd k hve order reltion τ within the collection of process vrints S i. Formlly: j, k N i, j k : vjk τ = A ijk = τ w i j, k N w i. i Fig. 6 shows the ggregted order mtrix V for the process vrints from Fig. 3. Agin, due to spce limittions, we only consider order reltions for ctivities H, I, J, X, Y, nd Z. Generlly, in n ggregted order mtrix, the min digonl is lwys empty since we do not specify the order reltion of n ctivity with itself. For ll other elements, non-filled vlue in certin dimension mens it corresponds to zero. Importnce of the Order Reltions Generlly, the order reltions computed by n ggregted order mtrix my be not eqully importnt. For exmple, reltionship V HI between H nd I (cf. Fig. 6) would be more importnt thn reltion V HZ, since ctivities H nd I pper together in ll six process vrints while ctivities H nd Z only show up together in vrint S 5 (cf. Fig. 3). We therefore define co-existence mtrix CE in order to represent the importnce of the different order reltions occurring within n ggregted order mtrix V. Definition 8 (Coexistence Mtrix).

13 VH S1 13 Order mtrices :25% S2 :20% S3 :15% S4 :10% S5 :20% S6 :10% V 0 : 65% 1 : 25% * : 0% - : 10% I= (0.65, 0.25, 0, 0.10) Aggregted order mtrix 0 1 * - 0 : successor 1 : predecessor * : AND-block - : XOR-block Fig. 6. Aggregted order mtrix bsed on process vrints Let S i = (N i, E i,...) P, i = 1, 2,..., n be collection of process vrints with ctivity sets N i. Let further w i represent reltive frequency of process instnces being executed bsed on S i. The Coexistence Mtrix of process vrints S 1,..., S n is then defined s 2-dimensionl mtrix CE m m with m = N i. Ech mtrix element CE jk corresponds to the reltive frequency with which ctivities j nd k pper together within the given collection of vrints. Formlly: j, k N i, j k : CE jk = S i : j N i k N i w i. Regrding our running exmple, Tble 7 shows the corresponding coexistence mtrix. Agin, due to spce limittions, we only depict the coexistence mtrix for ctivities H, I, J, X, Y, nd Z. For instnce, CE HI = 1 nd CE HZ = 0.2. This indictes tht order reltion between H nd I is more importnt thn the one between H nd Z. Structure Fitness of Cndidte Process Models Since we cn represent cndidte process model S c by its corresponding order mtrix A c (cf. Def. 6), we determine the structure fitting SF (S c ) between S c nd the vrints (cf. subsection 4.2) by mesuring how similr order mtrix A c nd ggregted order mtrix V (representing the vrints) re. We cn compute S c by mesuring the order reltions between every pir of ctivities in A c nd in V.

14 14 Fig. 7. Pirwise co-existnce of ctivities For exmple, consider reference model S s cndidte process model S c (i.e., S c = S). A prtil view of the corresponding order mtrix A is depicted in Fig. 8. Obviously, A HI = 0 holds, i.e., H is successor of I in model S (cf. Fig. 8). Consider now ggregted order mtrix V. Here the order reltion between ctivities H nd I is represented by the 4-dimensionl vector V HI = (0.65, 0.25, 0, 0.1). If we now wnt to compre how close A HI nd V HI re, we first need to build n ggregted order mtrix V c purely bsed on our cndidte process model S c (S in our cse). Fig. 8 shows both the order mtrix A c nd the clculted ggregted order mtrix V c of process model S c (S c = S). As order reltion between H nd I in V c, we obtin V HI c = (1, 0, 0, 0), i.e., H is lwys successor of I. We now cn compre V HI (which represents the vrints) with V HI c (which represents the reference model). We use Eucliden metrics f(α, β) to mesure the closeness between two vectors α = (x 1, x 2,..., x n ) nd β = (y 1, y 2,..., y n ): f(α, β) = α β α β = n i=1 x iy i n i=1 x2 i n i=1 y2 i (3) f(α, β) [0, 1] computes the cosine vlue of the ngle θ between vectors α nd β in Eucliden spce. If f(α, β) = 1 holds, α nd β exctly mtch in their directions; f(α, β) = 0 mens, they do not mtch t ll. Regrding our running exmple, we obtin f(v HI, V HI c ) = This number indictes high similrity between the order reltions of the cndidte process model with respect to H nd I nd the ones cptured by the vrints. Bsed on (3) which mesures similrity between the order reltions, nd Coexistence mtrix CE (cf. Def. 8) which mesures importnce of the order reltions, we cn define structure fitness SF (S c ) of cndidte model S c s follows: Definition 9 (Structure Fitness). Let S i = (N i, E i,...) P, i = 1, 2,..., n be collection of process vrints with ctivity sets N i. Let further CE be the

15 15 ) b) A c : order mtrix of cndidte model S c s originl reference model S V c : Aggregted order mtrix by cndidte model S c s originl reference model S Fig. 8. Order mtrix A c nd ggregted order mtrix V c constructed by cndidte model S c = S coexistence mtrix, nd V be the ggregted order mtrix of the collection of vrints. For cndidte model S c, let m = N c corresponds to the number of ctivities in S c ; let further V c be ggregted order mtrix of S c. we cn compute structure fitness SF (S c ) s follows: SF (S c ) = m j=1 m k=1,k j (f(v j k, V c j k ) CE j k ) m (m 1) (4) For every pir of ctivities j, k N c, j k, we first compute similrity of corresponding order reltions (s cptured by V nd V c ) by mens of f(v j k, V c j k ), nd second the importnce of these order reltions by CE j k. Structure fitness SF (S c ) [0, 1] of cndidte model S c then equls the verge of the similrity multiplied by the importnce of every order reltion. Regrding our exmple from Fig. 3, structure fitting SF (S) of the originl reference model S corresponds to Fitness Function So fr, we hve described the two mesurements ctivity coverge AC(S c ) nd structure fitting SF (S c ) to evlute fitness of cndidte model S c. While AC(S c ) mesures to wht degree ctivities occurring in the vrints re covered by the cndidte model S c, SF (S c ) mesures to wht degree S c fits structurlly to the vrints. Definition 10 (Fitness). For cndidte model S c, let AC(S c ) be the ctivity cover of S c nd let further SF (S c ) be the structure fitness of S c. We compute

16 16 the fitness F it(s c ) of cndidte model S c s follows: F it(s c ) = AC(S c ) SF (S c ) (5) As AC(S c ) [0, 1] nd SF (S c ) [0, 1], vlue rnge of F it(s c ) is [0,1] s well. Fitness vlue F it(s c ) indictes how close cndidte model S c is to the given collection of vrints. If F it(s c ) = 1 holds, cndidte model S c will fit perfectly to the vrints; i.e., no dditionlly dpttion will be needed. Otherwise, further dpttions might be required. The higher F it(s c ) is, the closer S c will be to the vrints nd the less configurtion efforts will be needed. Regrding our exmple from Fig. 3, fitness vlue F it(s) of the originl reference process model S is F it(s) = AC(S) SF (S) = = As fitness of cndidte model S c is evluted by ctivity coverge AC(S c ) multiplied by structure fitting SF (S c ), we cn utomticlly blnce the number of ctivities to be considered in cndidte model S c. If too mny ctivities of low relevnce (i.e., ctivities which only pper in limited number of instnces; e.g., ctivity Z in our exmple) re considered in the cndidte model, we will obtin high AC(S c ) vlue. However, SF (S c ) then possibly will decrese since coexistence vlues (cf. Def. 8) of such less relevnt ctivities re often very low (cf. Fig. 7). On the contrry, if S c contins too few ctivities, SF (S c ) cn potentilly be very high, while AC(S c ) will be too low in order to qulify S c s good cndidte model. Therefore, high vlue for F it(s c ) does not only men tht S c structurlly fits well to the vrints, but lso tht resonble number of ctivities is considered in the cndidte model. The complexity of computing F it(s c ) is polynomil. To be more precise, let n be the number of vrints nd let m = n i=1 S i be the totl number of ctivities in the vrints. The complexity to compute ctivity frequency (cf. Def. 4) is O(mn) nd the complexity to compute ggregted order mtrix V (cf. Def. 7) is O(2m 2 n). Bsed on this, the complexity to compute the fitness function is O(m + 2m 2 ). Note tht it is significntly lower thn N P level complexity s needed for computing the verge weighted distnce. As lredy discussed, the fitness function F it(s c ) is only resonble guess rther thn n exct mesurement (like verge weighted distnce). Therefore, we nlyze the performnce of our fitness function lter in Section 7. 5 Constructing the Serch Tree We hve sketched the bsic steps of our heuristic mining lgorithm in Section 3.2. In Section 4, we hve then discussed how to evlute cndidte process model S c bsed on fitness function F it(s c ). In this section, we show how we cn find dequte cndidte process models. For tht purpose we present best-first lgorithm which llows us to construct serch tree in such wy tht we cn find the best cndidte model in the serch spce. Section 5.1 first provides n overview on how we construct the serch tree by compring the result models from chnging ech ctivity j N c in the cndidte model S c. In Section 5.2, we further describe in wht wys we cn dpt prticulr ctivity j in order

17 17 to find ll kids of given cndidte process model S c, i.e., ll models which hve distnce one to S c. In Section 5.3, we provide serch results nd n evlution of our exmple models from Fig. 3. Finlly, Section 5.4 provides prototype implementtion of the described lgorithm. 5.1 The Serch Tree Let us revisit Fig. 4 which gives generl overview of our heuristic serch pproch. Strting with the current cndidte model S c, in ech itertion, we serch for its neighbors (i.e., process models which hve distnce 1 to S c ) to see whether we still cn find better cndidte model S c with higher fitness vlue. Generlly, we cn construct neighbor model for given process model S c = (N c, E c,...) by pplying one insert, delete, or move opertion to S c. All ctivities j N i (N i corresponds to the ctivity set of vrint S i ), which hve ppered in the vrint collection, re cndidte ctivities for chnge. Obviously, n insert opertion dds n ctivity j / N c to S c, while the other two opertions delete or move n ctivity j lredy present in S c (i.e., j N c ). Generlly, numerous process models my result by chnging one prticulr ctivity j on S c. Note tht the positions where we cn insert ( j / N c ) or move ( j N c ) ctivity j cn be numerous. Section 5.2 provides detils on how to find ll process models resulting from the chnge of one prticulr ctivity j on S c. In this section, first of ll, we ssume tht we hve lredy found the best process model (i.e., with highest fitness vlue) from ll the models resulting from chnging prticulr ctivity j on S c. We denote this model s the best kid S j kid of S c when chnging j. Our bsic ide is to crete ll neighbor models, to evlute ech of them with the fitness function, nd to finlly choose the one with highest fitness vlue. We present best-first lgorithm to perform our heuristic vrint mining (cf. Algorithm 1). To illustrte this lgorithm, we use the serch tree depicted in Fig. 9. Our serch lgorithm strts with setting the originl reference model S s the initil stte, i.e., S c = S (see the node t the top of Fig. 9). We further define AS s ctive ctivity set, which contins ll ctivities vilble for chnge. At the beginning, AS = { j j n i=1 N i} contins ll ctivities tht pper in t lest one process vrint S i. For ech ctivity j AS, we determine the corresponding best kid S j kid of S c when chnging j on S c (i.e., when deleting, moving or inserting j ). If the best kid S j kid hs higher fitness vlue thn S c, we mrk it with lines; otherwise, we mrk it white nd remove j from AS (cf. Fig. 9). 7 Afterwrds, we find the best one mong ll the best kids S j kid, i.e., the 7 For ll nodes mrked s white, we remove them from ctive ctivity set AS. Consequently, we stop serching the best kid when chnging them. In principle, it is still possible tht when chnging them lter (i.e., bsed on nother cndidte model S c), we cn find better resulting model. However, such chnce is very low due to the fct tht we hve lredy enumerted ll possible solutions by chnging such ctivity on S c. We therefore remove them from AS in order to reduce the serch spce.

18 A B A B C 18 Originl reference model Z Y Best kid when chnging A Best kid when chnging B Z Best kid when chnging Z Best kid when chnging Y Serch result Best sibling of ll best kids Best kid is better thn prent Best kid is NOT better thn prent Strt / End B Terminting condition: No kid is better thn its prent Fig. 9. Construct the serch tree one with highest fitness vlue. We denote this model s best sibling S sib nd we mrk the corresponding ctivity s ccordingly. Since this model S sib is the best model we re ble to obtin by one chnge opertion on cndidte model S c, we set S sib s the first intermedite serch result nd replce S c by S sib for further serch (cf. Fig. 9, S sib re mrked s bull s eyes). Note tht we lso remove s from AS since this ctivity hs now been lredy considered for chnge. The described serch method goes on itertively, until termintion condition is met, i.e., we either cn not find better model, or the llowed serch distnce is reched. The finl serch result S sib corresponds to our discovered reference model S (the node mrked by bull s eye nd circle in Fig. 9). 5.2 Options for Chnging One Prticulr Activity Section 5.1 hs shown how to construct serch tree by compring the best kids S j kid. This section discusses how to find such best kid Sj kid when chnging prticulr ctivity j, i.e., we discuss how to find the neighbors of cndidte model S c by performing one high-level chnge opertion (cf. Def. 1) on j. The best kid S j kid is consequently the one with highest fitness vlue mong ll considered models. Regrding prticulr ctivity j, we consider three type of bsic chnge opertions: delete, move nd insert ctivity (cf. Section 5.1). The neighbor model resulting through deletion of n ctivity j N c cn be esily determined by removing j from the process model nd the corresponding order mtrix; furthermore, movement of n ctivity cn be simulted by its deletion nd subsequent re-insertion t the desired position. Thus, the bsic chllenge in finding neighbors of cndidte model is to pply one ctivity insertion such tht block

19 19 input : A process model S; collection of process vrints S i = (N i, E i,...), i = 1,..., n; llowed serch distnce d ; output: Resulting process model S AS = n 1 i=1 N i /* Define AS s ctive ctivity set */; 2 S c = S /* Define initil cndidte model */; 3 t = 1 /* Define initil serch step */ ; 4 while AS > 0 nd t d /* Serch condition */; 5 do 6 S sib = S c /* Set S c s initil S sib */ ; 7 Define s s the selected ctivity ; 8 forech j AS do 9 S kid = FindBestKid(S c) ; 10 if Fitness(S kid ) > Fitness(S c ) then 11 if Fitness(S kid ) > Fitness(S sib ) then 12 S sib = S kid ; 13 s = j ; 14 end 15 else 16 AS = AS \ { j } ; /* Best kid not better thn its prent */ 17 end 18 end 19 if Fitness(S sib ) > Fitness(S c) then 20 S c = S sib ; /* Initite next itertion */ ; 21 AS = AS \ { s} ; 22 else 23 brek ; 24 end 25 t = t+1 ; 26 end Algorithm 1: Heuristic serch lgorithm for vrint mining structuring nd soundness of the resulting model cn be gurnteed. Obviously, for prticulr ctivity j, the positions where we cn (correctly) insert it into cndidte model S c re the subjects of our interest. Inserting j t (correct) position within S c results in one neighbor model. Therefore, finding ll neighbors first requires finding ll vlid positions where we cn correctly insert j in S c. Fig. 10 provides one exmple. Given process model S, we would like to find ll process models tht my result when inserting ctivity X into S. We pply the following two steps to simulte the insertion of n ctivity: 1. First, we enumerte ll possible blocks the cndidte model S contins. A block cn be n tomic ctivity, self-contined prt of the process model, or the process model itself (cf Algorithm 2 for n lgorithm enumerting ll possible blocks of process model). Note tht the number of possible

20 20 cndidte blocks cn become very lrge; e.g., hundreds of potentil blocks my exist for process model contining 50 ctivities. 2. After hving determined ll blocks of the current model we now cn simulte ll possible insertions of ctivity X. For this purpose, we cn cluster X with ech block nd position it in reltion to this block, i.e., we cn set order reltion τ {0, 1,, } between X nd the selected block B (cf. Def. 6). This wy, we insert X to the respective position such tht it forms nother block together with B; or in nother word, we replce block B by nother block B which contins B nd X. Consequently, we obtin neighbor model S by inserting X into S. Following these two steps, we cn gurntee tht the resulting process model is sound nd block-structured. Every time we cluster n ctivity with block, we ctully dd this ctivity to the position where it cn form bigger block together with the selected one, i.e., we replce self-contined block of process model by bigger one. Consider our exmple from Fig. 10. Among the determined blocks, we cn find the sequentil block defined by ctivities C nd D (step 1). Then we cn cluster ctivity X with this block using order reltion τ = 0 for exmple (step 2). Consequently, we obtin S s one neighbor of S (cf. Fig. 10). In the following, we describe these two steps in detil. I G C D J H Enumerte blocks Blocks contining n ctivities n = 1 n = 2 Blocks {I}, {G}, {C}, {D}, {J}, {H} {C, D}, {J, H} Cluster X with block {C, D} by τ= 0 I C G D X J H S: process model n = 3 n = 4 {C, D, G} {I, C, D, G}, {C, D, G, H} S : one possible resulting model fter inserting ctivity X in S C D G H I J C 1 * D 0 * G * * H I J A S: Order mtrix of S Sme order reltions n = 5 n = 6 {I, C, D, G, J}, {C, D, G, J, H} {I, C, D, G, J, H} τ= 0 C D G H I J C D G H I J 0 X * * * X 1 1 * * * Copy of block {C,D} A S : Order mtrix of S Step 1: enumerte blocks ) b) Step 2: clustering X I C G D J H G I X J H C D I G X C D H J 56 potentil neighbors Cluster X with block {I, C, D, G, J, H} by τ= 1 Cluster X with block {G} by τ= * Cluster X with block {J, H} by τ= - c) Some exmple neighboring models by inserting X into S Fig. 10. Finding the neighboring models by inserting X into process model S

21 21 Step 1: Block-enumerting Algorithm We now present n lgorithm to enumerte ll possible blocks of process model S. Let S = (N, E,...) P be process model with N = { 1,..., n }. Let further A N N be the order mtrix of S. Two ctivities i nd j cn form block if nd only if [ k N \ { i, j } : A ik = A jk ] holds; i.e., two ctivities cn form block if nd only if they hve exctly sme order reltions to the remining ctivities. Consider our exmple from Fig. 10. Here ctivities C nd D cn form block since they hve sme order reltions to the remining ctivities G, H, I nd J. input : A process model S = (N, E,...) nd its order mtrix A output: A set BS with ll possible blocks 1 Define BS x be set of blocks contining blocks with x ctivities. x = (1,..., n); 2 Define ech ctivity i s block B, i = (1,..., n) ; 3 BS 1 = {B 1,..., B n }. /* initil stte */ ; 4 for i = 2 to n /* Compute BS i */; 5 do 6 let j = 1; let k = i; 7 while j k do 8 k = i - j /* A block contining k ctivities cn only be obtined by merging blocks contining i nd j ctivities */; 9 forech (B j, B k ) BS j BS k ; 10 merge = TRUE /* judge whether B j nd B k cn form block */; 11 do 12 if B j Bk = /* Disjoint? */ then 13 forech ( α, β, γ ) B j B k (N \ B j Bk ) do 14 if A αγ A βγ then 15 merge = FALSE /* two blocks con merge only if they show sme order reltions to the ctivities out side the two blocks */; 16 brek ; 17 end 18 end 19 else 20 merge = FALSE; 21 end 22 if merge = TRUE then 23 B p = B j Bk ; 24 BS i = BS i Bp ; 25 end 26 end 27 j = j + 1 ; 28 end 29 end BS = n x=1 BSx Algorithm 2: Block enumerting lgorithm

22 22 The block-enumerting lgorithm is depicted in Algorithm 2. Let us first define BS x s the set contining ll blocks comprising exctly x ctivities. In its initil stte, ech ctivity forms single block by its own (line 2) nd consequently we obtin BS 1 (line 3). The lgorithm strts by computing BS 2 (blocks contining 2 ctivities) nd continues itertively to compute BS i until it reches its upper boundry i = n. In ech itertion, we cn determine block contining x ctivities by merging two disjoint blocks contining j nd k ctivities respectively (i = j +k) (line 8). For exmple, block contining 2 ctivities cn only be obtined by merging two blocks of which ech contins 1 ctivity. Or we cn only obtin block contining 5 ctivities by merging two disjoint blocks contining either 1 nd 4 ctivities respectively or 2 nd 3 ctivities respectively (line 4-29). Lines 9 to 26 check whether or not two blocks B j nd B k cn be merged together. This is possible iff ny ctivities α B j nd β B k show sme order reltions to the remining ctivities outside the two blocks. Otherwise (lines 14-17), B j nd B k cnnot form block (i.e., merge = F ALSE). Until we obtin ll sets of blocks BS x with x = 1,..., n ctivities per block, we cn define set BS s BS = n x=1 BS x. BS consequently corresponds to ll blocks, process model S contins (line 28). In our context, we consider ech block s set rther thn process model, since structure of these blocks is lredy cler in S. 8 Consider the exmple from Fig. 10. For the given process model S, ll possible blocks re enumerted. For exmple, s ctivities C nd D show sme order reltions in respect to the remining ctivities in order mtrix A s, they my form block. Or, block {C, D} nd block {G} show sme order reltions in respect to remining ctivities H, I nd J; therefore they cn form bigger block {C, D, J}. As S contins 6 ctivities, its blocks re orgnized in 6 groups contining blocks of different sizes. Step 2: Cluster Inserted Activity with One Block In Step 1, we hve shown how to enumerte ll possible blocks for given cndidte model S c. Bsed on this, we describe where we cn insert prticulr ctivity j in S c such tht we obtin sound nd block-structured model gin. Assume tht we wnt to insert ctivity X in S (cf. Fig. 10). To ensure the block structure of the resulting model, we cluster X with n enumerted block, i.e., we replce one of the previously determined blocks B by bigger block B contining B s well s X. In the context of this clustering, we set the order reltion between block B nd inserted ctivity X s τ {0, 1,, } (see Def. 6), i.e., the order reltions between X nd ll ctivities of B re defined by τ. One exmple is given in Fig. 10b, where the inserted ctivity X is clustered with block {C, D} by order reltion τ = 0, i.e., we set X s successor of the 8 Worst-cse, the complexity of this lgorithm is 2 n where n corresponds to the number of ctivities. However, this worst-cse scenrio will only occur if ny combintion of ctivities my form block (like process model for which ll ctivities re ordered in prllel to ech other). During our simultion, in most cses we were ble to enumerte ll blocks of process model within few milliseconds. This indictes tht complexity is low in prctice.

23 23 sequence block contining C nd D. To relize this clustering, we hve to set the order reltions between X on the one hnd nd ctivities C nd D from the selected block on the other hnd to 0. Furthermore, order reltions between X nd the remining ctivities re the sme s for C nd D respectively. Afterwrds these three ctivities form new block {C, D, X} replcing the old one (i.e., {C, D}). This wy, fter inserting X into S, we obtin new sound nd block-structured process model S. Fig. 10b shows only one resulting model S which we obtin when inserting X in S. Obviously, S is not the only neighboring models in this context since we cn insert X t different positions in S; i.e., for ech block S enumerted in Step 1, we cn cluster it with X by ny one of the four order reltions τ {0, 1,, }. Regrding our exmple from Fig. 10, S contins 14 blocks. Consequently, the number of models tht my result when inserting X in S equls 14 4 = 56; i.e., we obtin 56 potentil models by inserting X into S. Fig. 10c shows some neighboring models of S. Note tht the 56 resulting models re not necessrily unique, i.e., it is possible tht some of them re sme. However, this is not n importnt issue in our context since our fitness function F it(s c ) cn be quickly computed. Therefore, some redundnt informtion will not significntly decrese the performnce of our heuristic serch lgorithm. 5.3 Serch Result for Our Running Exmple Regrding our exmple from Fig. 3, we now present the serch result we obtin when pplying our heuristics serch lgorithm. Fig. 11 does not only show the finl resulting model, but ll the intermedite process models discovered during the serch. Note tht in this scenrio, we do not set ny limittion on the number of serch steps, i.e., we llow the lgorithm to go s fr s possible to find the best reference model. Fig. 11 shows the evolution of the originl reference model S. The first opertion δ 1 = move(s, J,B, endf low) chnges S into intermedite result model R 1. According to Algorithm 1, R 1 constitutes tht neighbor model of S which cn be derived by pplying one vlid chnge opertion to S nd which shows highest fitness vlue in comprison to ll other neighbor models of S. Using R 1 s next input for our lgorithm, we discover process model R 2. In this context, chnge opertion δ 2 = insert(r 1, X, E, B) is pplied. Finlly, we obtin R 3 by performing chnge δ 3 = move(r 2, I,D,H) on model R 2. Since we cnnot find better process model by chnging R 3 nymore, we obtin R 3 s finl result. Note tht if we set constrints on llowed serch steps (i.e., we only llow to chnge originl reference model by mximl d chnge opertions), the finl serch result would be s follows: R d if d 3 or R 3 if d > 3. We further compre the originl reference model S nd ll (intermedite) serch results in Tble 2. We first show the fitness vlue of ll the models in Fig. 11. As our heuristic serch lgorithm is bsed on finding process models with better fitness vlues, we cn observe improvements of the fitness vlues with ech serch step. The fitness vlue F it(s) increses from (model S) to (model R 1 ), nd

24 1 24 = M o v e ( S, J, B, e n d F lo w ) S[ 1>R1 I E A E F B G J C D S: reference model A B J F G I C D H H E E I A X F C A X F C G D G B B D I J J H R3 : result fter three chnges (Finl result) H >R3 [ 3 R2 ) H, D I,, 2 ( R e v p M = 3 R1 : result fter one chnge R 1 [ 2 >R 2 2=Insert (R1, X, E, B) R2 : result fter two chnges Fig. 11. Serch result by every chnge opertions then to (model R 2 ). Finlly, it reches (model R 3 ). Though such fitness vlue is only resonble guessing of how good the result model is, the improvement of the fitness vlue t lest indictes tht discovered models is ssumed to get better in ech itertion. Still, we need to exmine whether or not the discovered process models re indeed getting better. We therefore compute the verge weighted distnce between the discovered model nd the vrints, which is precise mesurement in our context. From Tble 2, the improvement of verge weighted distnces fter pplying the bove chnges becomes cler, i.e., the verge weighted distnce drops monotoniclly from 4 (when considering model S) to 2.25 (when considering model R 3 ). Mesuring the verge weighted distnce shows tht for the given exmple, the lgorithm performs s expected. One importnt reson to design heuristic serch lgorithm in our context ws to be ble to only consider the most relevnt chnge opertions, i.e., the importnt chnges (reducing verge weighted distnce between reference model nd vrints most) should be discovered t beginning while the trivil ones should be either ignored or be put t the end (cf. Section 1). We therefore dditionlly evlute delt-fitness nd delt-distnce, which indicte the reltive S R 1 R 2 R 3 Fitness Averge weighted distnce Delt fitness Delt Distnce Tble 2. Serch result by every chnge

25 25 improvement of fitness vlues nd the reduction of verge weighted distnce for every itertion of the lgorithm. For exmple, the first chnge opertion δ 1 chnges S into R 1, nd consequently improves fitness vlue (delt-fitness) by nd reduces verge weighted distnce (delt-distnce) by 0.8. Similrly, δ 2 reduces verge weighted distnce by 0.6 nd δ 3 by It is obvious tht the delt-distnce is monotoniclly decresing s the number of chnge opertions increses. This indictes tht the importnt chnges re performed t beginning of the serch, while the less importnt ones re performed t the end. Another importnt feture of our heuristic serch is its bility to utomticlly decide on which ctivities shll be included in the reference model. A predefined threshold or filtering of the less relevnt ctivities in the ctivity set re not needed. In our exmple, X is utomticlly inserted, when Y nd Z re not. The only concern in our heuristic vrint mining is to reduce the verge weighted distnce, i.e., the three chnge opertions (insert, move, delete) re utomticlly blnced bsed on their influence on the reduction of verge weighted distnce. This is lso significnt improvement when compred to mny other process mining techniques in which preprocessing of trivil ctivities should be conducted before performing the mining [15, 38]. 5.4 Proof-of-Concept Prototype The described pproch hs been implemented nd tested using Jv. Figure 12 depicts screenshot of our prototype. We hve used our ADEPT2 Process Templte Editor [27] s tool for creting process vrints. For ech process model, the editor cn generte n XML representtion with ll relevnt informtion (like nodes, edges, blocks) being mrked up. We store creted vrints in vrints repository (cf. Fig. 12) which cn be ccessed by our mining procedure. Fig. 12. Screenshot of the prototype

26 26 The mining lgorithm hs been developed s stnd-lone Jv progrm, independent from the process editor. It cn red the originl reference model nd ll process vrints, nd it cn generte the result models ccording to the XML schem of the process editor. All intermedite serch results re lso stored nd cn be visulized using the ADEPT2 editor. ADEPT2 is next genertion dptive process mngement tool, which llows for the flexible execution of process instnces. Prticulrly, the ADEPT2 frmework enbles d-hoc chnges of single process instnces during runtime s well s chnges t the process type level nd their propgtion to running instnces if desired nd possible [26]. Bsed on the presented mining lgorithm, ADEPT is ble to provide full process lifecycle support. 6 Simultion Setup Clerly, using one exmple to mesure the performnce of our heuristic mining lgorithm is fr from being enough. Since computing the verge weighted distnce is t N P level, the fitness function, whose clcultion only needs polynomil time, is only n pproximtion of verge weighted distnce. Therefore, these two mesures cn NOT be correlted perfectly, i.e., it is not for sure tht the improvement of the fitness vlue will lwys result in deduction of verge weighted distnce. Therefore, the first question is how the fitness improvement (delt-fitness) correltes with the reduction of verge weighted distnce (deltdistnce)? Moreover, we wnt to nlyze whether the lgorithm cn scle up. Clerly, it will tke longer time to find the result if we hve to cope with lrge collection of vrints with dozens or even hundreds of ctivities, simply becuse the serch spce would be significntly lrger. Wht is more importnt to know is whether the performnce will lso chnge when fcing lrge models, i.e., does the correltion between delt-fitness nd delt-distnce depend on the size of the model? In ddition, we re interested in whether it is relly true tht the most importnt chnge opertions (i.e., the chnge opertions which lrgely reduce verge weighted distnce) re performed t beginning of the serch. If this is the cse, we do not need to fer too much when setting serch limittions or filtering out the chnge opertions performed t the end. Therefore, the third reserch question is s follows: To wht degree re the importnt chnge opertions positioned t the beginning of the serch steps? I.e., to wht degree re delt-fitness nd delt-distnce monotoniclly decresing s the number of chnge opertions increses? Finlly, we investigte whether we cn further improve the performnce of our heuristic vrint mining lgorithm using some other dt mining or rtificil intelligence techniques, i.e., we try to dopt the concept of pruning s commonly used in dt mining [36] nd rtificil intelligence [20]. In our context of vrint mining, we cn prune out the sitution when delt-fitness is not nicely correlted with delt-distnce, nd consequently improve the perfor-

27 27 mnce of our lgorithm by dpting our lgorithm to such sitution. Therefore, the lst reserch question is s follows: : How cn we improve the performnce of our heuristic mining lgorithm by dopting the concept of pruning? We try to nswer ll these questions using simultion, i.e., by generting thousnds of dt smples, we cn provide sttisticl nswer for these questions. To summrize, we use simultion to nswer the following reserch questions: 1. How much is delt-fitness correlted to delt-distnce in our heuristic lgorithm? 2. Cn the performnce of the lgorithm scle up? Tht is, does the correltion between fitness vlue nd verge weighted distnce depend on the size of models? 3. Is it relly true tht the importnt chnge opertions re performed t the beginning? Tht is, does improvement of the verge weighted distnce monotoniclly decreses s the number of chnge opertions increses? 4. Would it be possible to further prune the dt such tht we cn improve the performnce of our heuristic mining lgorithm? In this section, we describe how we setup the simultion, simultion results themselves re presented in Section 7. In totl, we creted 72 groups of dtsets bsed on different scenrios. Ech of these groups consists of 1 reference model nd 100 vrints configured out of it. In totl, we crete 7272 process models for nlysis. This section is orgnized s follows. Section 6.1 describes n lgorithm to generte rndom reference process model. Section 6.2 describes bsed on which prmeters we cn configure collection of vrints out of such rndomly creted reference model. Section 6.3 summrizes the considered scenrios nd describe how we dpt the prmeters when creting the respective vrints. Finlly, Section 6.4 shows how simultions re setup to mesure the performnce of our heuristic vrint mining lgorithm. 6.1 Generting the Reference Process Model Our generl ide of rndomly generting (block structured) reference model is to cluster blocks, i.e., we rndomly cluster ctivities (blocks) into bigger block nd this clustering continues itertively until ll the ctivities (blocks) re clustered together. The detil of our pproch is depicted in lgorithm 3. To illustrte how Algorithm 3 works, n exmple is given in Fig. 13. As input set of ctivities {A,B,C,D,E} re given, nd the gol is to construct vlid, block-structured process model S out of them. The lgorithm strts by considering ech ctivity i s bsic block B i, nd dding these blocks to set B (lines 1 nd 2). Regrding our exmple, B = {{A}, {B}, {C}, {D}, {E}}. The lgorithm first rndomly select two blocks B i, B j (lines 4) nd cluster them together with rndomly chosen order reltion τ (lines 5 nd 6). Regrding our exmple, blocks {B} nd {C} re selected to construct new block {B, C} with rndomly chosen order reltion 1 (which mens B precedes C). The newly

28 28 input : Set of ctivities i the process model to be generted should contin, i = (i,..., n) output: Vlid process model S 1 Define ech ctivity i s bsic block B i, i = (1,..., n); 2 Define set B := {B 1,..., B n} /* initil stte */ ; 3 while B > 1 do 4 rndomly selected two blocks B i, B j B ; 5 rndomly select n order reltion τ {0, 1,, } ; 6 build block B k which contins sub-blocks B i nd B j hving order reltion τ ; 7 8 B := B \ {B i, B j } ; B := B {B k } ; 9 end 10 S := B 0 with B 0 B Algorithm 3: Rndomly generting reference model creted block {B, C} will then replce blocks {B} nd {C} in the block set B, i.e., B = {{A}, {B, C}, {D}, {E}} (lines 7 nd 8). This procedure (lines 4-8) is repeted until block set B contins one single block B 0 (B 0 = {A,B,C,D,E} regrding our exmple). This block then represents our rndomly generted process model S. (line 10). Fig. 13 shows the process model we rndomly generted s well s the block constructed in ech itertion. Order reltion rndomly chosen 0 1 E 1 * B C E D A Result A B C 1 2 D 3 4 Fig. 13. Exmple of generting rndom process model In prctise, certin order reltions re used more often thn others. For exmple, the predecessor-successor reltion is used more frequently thn AND/XORsplits [46]. When rndomly generting process model, we therefore tke this into ccount s well. Rther thn rndomly setting the order reltion for two blocks, we set the probbility for choosing AND-split (τ = ) nd XORsplit (τ = ) to 10% respectively, while predecessor-successor reltionships (τ = {0, 1}) re chosen with probbility of 80%.

29 Prmeters Considered for Generting Process Vrints Tking generted reference process model, we control how vrints re configured by djusting specific prmeters. For exmple, these prmeters determine how mny chnge opertions needed to perform to configure prticulr vrint nd where ctivities should be moved or inserted to nd so forth. Bsiclly, we hve considered the following prmeters when generting the process vrints. 1. Prmeter 1 (Size of Process Models) The size of vrints (i.e., the number of its ctivities) cn potentilly influence results. Therefore, we need to check the behvior of our lgorithm when pplying them to vrints of different sizes. This is lso importnt to test the sclbility of our lgorithm, i.e., whether or not the correltion of delt-fitness nd delt-distnce depends on th size of the models. 2. Prmeter 2 (Similrity of Process Vrints) This prmeter mesures how close these vrints re, e.g., whether or not the vrints re similr to ech other. In this context, similrity mesures how difficult it is to configure one vrint into nother [17]. 3. Prmeter 3 (Activity Occurrence) One importnt feture of our heuristic mining lgorithm is its bility to utomticlly decide which ctivities should or should not be considered in the discovered model. Clerly, ctivities with low ctivity frequency (cf. Def. 4) my not be considered. We use this prmeter to perform detiled nlysis of this sitution. 4. Prmeter 4 (Activity Consistence) As ctivity frequency cn be quickly computed by scnning the ctivity sets of the vrints, it is lso importnt to know whether the position for one prticulr ctivity is consistent. e.g., if n ctivity is often inserted, we re lso interested in whether such ctivity is lwys inserted into prticulr position. This prmeter therefore helps us to exmine whether or not the position where ctivities re inserted or moved to cn influence our serch lgorithm. As prmeter 1 nd 2 re esy to understnd, we use Fig. 14 to provide further nlyze of prmeter 3 nd 4. Consider the sitution when n ctivity j is very often inserted when configuring vrints (high occurrence), nd its inserted position is lso very consistent (high consistency). The heuristic lgorithm therefore should hve high chnce to lso insert j in the reference model, since such insertion is very often used in configuring ech vrint (up-right prt of Fig 14). On the contrry, if n ctivity ppers in only few vrints (low occurrence) nd its positions in those vrints re lso chnging ll the time (low consistency), we therefore cn ignore such ctivity (down-left prt of Fig. 14) since such chnge does not repet often. The difficult prt is the rest two prts in the vlue spce (mrked with question mrk). It would be difficult to know whether we should insert one ctivity with high ctivity frequency but its positions re not very stble. Reson is tht even if we insert this ctivity in the reference model, we lso need to move it frequently since its position in the vrints re not stble, therefore inserting such ctivity is not necessrily led to the reduce of the verge weighted distnce. On the

30 30 1 Occurrence Drop Keep 0 Consistency 1 Fig. 14. Consistency nd occurrence contrry, if n ctivity does not pper often in the vrints, but its positions in these vrints re very consistent, it is lso difficult to determine whether or not we should insert this ctivity in the reference model. If we insert it, we need to often delete it since it does not shows up often in the vrints. Therefore, in the following subsections, we will explin how to simulte these situtions to cover the vlue spce in our simultion. 6.3 Methods for Generting Dt Sets When configuring the vrints bsed on given reference model, we vry the vlues of the prmeters described in subsection 6.2 nd consequently generte vrints bsed on different scenrios. The following choices re vilble for the different prmeters: Prmeter 1 (Size of Process Models) This prmeter controls how mny ctivities shll be contined in the originl reference model nd consequently control, in generl, the size of process vrints. There cn be three options: 1. Smll-sized reference models: contining 10 ctivities 2. Medium-sized reference models: contining 20 ctivities 3. Lrge-sized reference models: contining 50 ctivities In our simultion, we use the sme reference model for the groups contining process models of sme size. Reson is tht we wnt to void the influence of the rndomly generted reference model. According to [21], process models contining more thn 50 ctivities hve high risk of errors. therefore, it is not recommended to design such lrge model. Following this guideline, we lso set the lrgest size of process model for 50 ctivities in our simultion. Note tht the vrints my hve different ctivity sets thn the reference model since we lso employ insert nd delete opertions when configuring vrint. Prmeter 2 (Similrity of Process Vrints) The closeness between the vrints is mesured by the totl number of chnge opertions we pply when generting vrints (cf. Def. 2). Three possible choices exist:

31 31 1. Smll-chnge: 10% of ctivities re chnged 2. Medium-chnge: 20% of ctivities re chnged 3. Lrge-chnge: 30% of ctivities re chnged For exmple, for the dtsets comprising lrge-size process vrints (i.e., vrints with 50 ctivities), medium-chnge would men we need to perform 10 chnge opertions on the reference model to configure prticulr process vrint. This wy, we cn control the distnce between the reference model nd its vrint. And indirectly, we cn control the similrity between vrints since they re ll controlled in certin distnces with the reference model. Prmeter 3 (Activity Occurrence) We use Fig. 15 to illustrte how we configure prmeter 3. The X-xis represents list of ctivities nd the Y-xis shows their corresponding probbility being involved in process configurtions. Given reference model S, let j N, j = 1..., m be the ctivities in S. Bsed on prmeter 1 nd 2, we cn decide on by performing how mny chnge opertions we re ble to configure S into prticulr vrint S i. For exmple, if prmeter 1 is lrge-sized model (S contins 50 ctivities) nd prmeter 2 is lrge-chnge (chnge 30% of ctivities in S), we know tht we need to chnge 15 ctivities to configure one vrint out of S. Let x be the number of chnge opertions, we cn crete x new ctivities x, x = n n + x. Then the m + x ctivities constitute the X-xis in Fig. 15. Probbility for chnge x = number of chnge opertions m-x+1... m-1 m m+1... m+x stble moved inserted Activity Old New Fig. 15. Chnge occurrence for ctivities in the reference model The Y-xis therefore shows the probbility ech ctivity is involved in chnge opertions. In order to compre ctivities with different probbility being involved in chnge opertions, we ssign ech ctivity with different probbility. We use Gussin distribution to relize the difference. For ctivity j, j = (m x + 1)... (m + x), the probbility it involved in chnge opertions is j j 1 (j m) 2 1 2π e 2( x 3 )2 (i.e.,x (m, ( x 3 )2 ), which mens the expected vlue of the dis-

32 32 tribution is m with stndrd devition x 3. The probbility of chnging j is the integrl in the intervl [j-1,j) ). Tble 3 shows the probbility for the groups with prmeter 1 being smll-sized nd prmeter 2 being lrge-chnge. Note tht ctivity K, L nd M, re not in the originl reference model S. The purpose of using Gussin distribution here is only to simulte the sitution tht ctivities hve different probbility being involved in chnges. We do NOT ssume tht the probbility for ctivities being involved in chnge must follow Gussin distribution or ny other distributions. Activity A B C D E F G H I J K* L* M* Probbility Tble 3. The probbility for ech ctivity being involved in chnge opertions when configuring process vrints After we described how to simulte the probbility for ech ctivity being involved in chnges, now we describe wht kind of chnge opertions we consider on generting ech vrints. Since the originl reference model S only contin ctivities j, j = 1... m, if n ctivity j, j = m m + x re involved in process configurtions, we cn only insert such ctivity in S in order to configure prticulr vrint S i. For the ctivities j, j = m x m which lredy exist in the reference model S, we cn move such ctivity to nother position when configuring prticulr vrint S i. We ignored delete opertion when generting the dtset since it cn be esily hndled. 9 While how often n ctivity j is inserted when generting vrints cn be esily obtined by ctivity frequency g( j ), it is difficult to know how often ctivities re involved in move opertions becuse move opertion does not led to the chnge of the ctivity set nd structure chnges re difficult to identify. We design our simultion by considering both insert nd move opertion lso with the purpose of checking whether these two types of chnge opertions re considered eqully importnt by our heuristic mining lgorithm. Since the move nd insert opertions hve the sme probbility to be employed when generting dtset, we would lso expect the heuristic mining lgorithm cn lso pply reltively sme mount of chnge or insert opertions to discover reference model. If it is not the cse, it is n indiction tht the fitness function is not properly designed. Prmeter 4 (Activity Consistence) While prmeter 3 describe how often prticulr ctivity is chnged, prmeter 4 controls where these ctivities re chnged (moved or inserted) to. As we discussed in subsection 5.2, there re numerous options to insert n ctivity j into prticulr process model S c. j cn be clustered with ny block 9 Delete opertion is only not considered in generting dtset but is still considered in performing mining. Deleting ctivity j leds to the chnge of ctivity frequency g( j), which cn lso be relized by insert opertion.

33 33 in S c by one of the order reltion τ = {0, 1,, }. Since the number of blocks contined in process model is often lrge, the number of possible resulting models by inserting j in S c is lso lrge. Therefore, for prticulr group, we cn define severl chnge opertions s consistent chnge opertions, i.e., when n ctivity j is inserted into S c, it is lwys clustered with prticulr block by prticulr order reltion τ. Therefore, we cn control how frequent we perform the corresponding consistent chnge opertions to configure prticulr vrint S i. The first step is then to determine these consistent chnge opertions for prticulr group. For exmple, consider Fig. 15. For given reference model S, we only llow performing x chnges from ctivity set j, j = (m x+1)... (m+x) to configure prticulr process vrints S i. The remining ctivities, i.e., j, j [1, (m x)], re stble, i.e., will not be chnged. These ctivities re the suitble cndidte blocks to be cluster with. For ech ctivity j, j [(m x + 1), (m + x)], we cn find corresponding ctivity j, j [1, (m x)], so tht if we need to perform consistent chnge on j, it is lwys clustered with j by prticulr order reltion τ. For exmple, consider the cse we described in Tble 3. Activities A, B, C, D, E, F nd G will not be involved in chnge since their probbility for chnge ll equl to 0. Therefore, if we need to move H, I or J or to insert K, L or M, we lwys cluster them with one of the stble ctivities, e.g., if we need to move J to configure prticulr vrint by consistent chnge opertion, it will lwys be clustered with B by order reltion sying τ = 0. We therefore cn control ctivity consistence of ctivity j by setting how often consistent chnge opertion is performed. If j is to be chnge, we set the probbility to perform consistent chnge opertion to p, nd the probbility to perform rndom chnge opertion (i.e., rndomly select block nd cluster it with rndom order reltion) is consequently 1 p. In order to nlyze the reltion between the ctivity consistency nd ctivity occurrence, we hve designed four different scenrios to cover the spce s illustrted in Fig. 14. The scenrios re depicted in Fig. 16. These four scenrios re constructed by keeping either occurrence or consistency stble while chnging the others: 1. LowOcc scenrio keeps occurrence of ctivities ll t 30%, while mkes the consistency of these ctivities vrying from 0 to 80%. 2. HighOcc scenrio keeps the occurrence of ctivities ll t 70% while mking the consistency of them lso vrying from 0 to 80%. 3. LowCon scenrio keeps the consistency of ctivities ll t 30% while mking the occurrence chnge from 0 to 80%. 4. HighCon scenrio keeps the consistency of ctivities ll t 70% while mking the occurrence chnges from 0 to 80%. Note tht the scenrio is not describing prticulr ctivities but set of ctivities. For the group with prmeter 1 being lrge-sized nd prmeter 2 being lrge-chnge, 30 ctivities (15 for move nd 15 for insert) will be involved in configuring the 100 process vrints with different occurrence or consistency vlues. Therefore, one prticulr scenrio cn cover lrge spce in the vlue

34 34 Occurrence LowCon HighCon 70% HighOcc 30% LowOcc 30% 70% Consistency Fig. 16. Spce coverge using simple scenrios spce (cf. Fig. 16). This cn significntly enhnce our reserch since for ech scenrios, we hve collection of ctivities for nlysis. In order to better cover the vlue spce, we hve designed four dditionl scenrios. While the occurrence lwys follows the curve s depicted in Fig. 15, the consistency of the corresponding ctivities re depicted in Fig Positively correlted scenrio mkes the consistency of prticulr ctivity positively correlted to its occurrence, i.e., when ctivity j hs high occurrence, it lso hs high consistency. This pplies to both moved nd inserted ctivities. 2. Negtively correlted scenrio, on the contrry, mkes the consistency negtively correlted to the occurrence. When j hs high occurrence vlue, it should hve low consistency. This pplies lso to both moved nd inserted ctivities. 3. Focus on move scenrio ssigns high consistency to the moved ctivities while set low consistency to the inserted ctivities. 4. Focus on insert scenrio on the contrry ssigns high consistency to the inserted ctivities nd low consistency to the moved ctivities. Fig. 17 lso depicts the vlue spce s covered by ech scenrio. We dditionl design these scenrio not only to better cover the vlue spces, but lso to nlyze whether the insert nd move opertions re considered eqully importnt by the lgorithm, i.e., since insert nd move opertions re eqully used when generting the dtset, our heuristic mining lgorithm lso should not show significnt difference on move nd insertions when discovering the reference model. To sum up, this subsection describes how ech group of dtset re generted by djusting the vlues of different prmeters. Since we hve 3 options for prmeter 1 (smll-size, medium-size nd lrge-sized), 3 options for prmeter

35 1 2 - x m m +x 1 2 m -x m -1 m m +x o ns i ste n cy o ns i ste n cy x m m +x 1 2 m -x m -1 m m +x o ns i ste n cy o ns i ste n cy 35 Consistncy Consistncy 80% stble 80% m m m moved inserted Old New Positively correlted Occurrence Occurrence C C Consistncy Consistncy 80% stble 80% Old m moved inserted New Negtively correlted Occurrence Occurrence C C Activity stble m moved m m inserted stble moved m inserted Old Focus on move New Old New Focus on insert Fig. 17. Spce cover using complex scenrios 2 (smll-chnge, medium-chnge nd lrge-chnge) nd 8 scenrios to cover the vlue spce of occurrence nd consistency, we hve in turn generted = 72 groups of dtset contining 1 reference model nd 100 vrints configured from it. In the next subsection, we will describe wht informtion we cn obtin from ech group of dtset. 6.4 Simultion Setup For ech one of the 72 groups of dtset constructed bsed on different scenrios, we perform our heuristic mining to discover new reference model by mining the collection of vrints. We do not set ny constrints on serch steps, i.e., the lgorithm will only terminte when no better model cn be discovered. We used Dell Ltitude Lptop (2.4 GHZ CPU nd 3.5 GB RAM) to run our simultion under Windows environment. The following informtion re documented in ech group: 1. Originl reference model, i.e., the model bsed on which we perform the chnges (cf. subsection 6.1).

36 process vrints. Bsed on given reference model, we generte ech vrint by configuring the reference model ccording to the different scenrios s described in subsection 6.3. For ech group, we generte 100 process vrints. Note tht lthough the 100 vrints re generted by following sme scenrio, these models re NOT sme. The reson is tht the scenrio only depicted the feture of the collection of vrints but not prticulr vrint. 3. Serch results. We document ll the intermedite process models s well s the end serch result s obtined from the heuristic serch. The corresponding chnge opertions re lso documented. As exmple consider Fig. 11 s the heuristic serch result we cn obtin by mining the reference model S nd vrints S i from Fig Fitness nd verge weighted distnce. Similr to the evlution results we presented in Tble 2 of serch results (cf Fig. 11), we lso compute the fitness nd verge weighted distnce vlue of every intermedite process models s obtined from our heuristic mining. We dditionl documents delt-fitness nd delt-distnce in order to exmine the influence of every chnge opertion. 5. Execution time. At lst, we lso document The running time to perform our heuristic serch lgorithm. 7 Simultion Result In Section 6, we hve described how the simultion is setup nd wht reserch questions we wnt o nswer through the simultion, in this section we give the nswer to these reserch questions. In prticulr: 1. Subsection 7.1 will provide the bsic nlysis of the heuristic lgorithm, e.g., verge distnce reduction, running time etc. 2. Subsection 7.2 will show the correltion nlysis of delt-fitness nd deltdistnce. 3. Subsection 7.3 will compre the correltion vlues of process models with different size to nswer whether the performnce of our lgorithm cn scle up. 4. Subsection 7.4 will exmine whether nor not the importnt chnges re performed t the beginning. 5. Subsection 7.5 will provide method to trin threshold vlue to improve the performnce of our heuristic mining lgorithm. 7.1 Bsic Performnce Anlysis Improvement on verge weighted distnces In 60 (our of 72) groups, we re ble to discover new reference model different thn the originl one. The verge weighted distnce of the discovered model is shorter thn tht of the originl reference model, i.e., setting the discovered model s the

37 37 new reference model cn reduce on verge chnge opertions to configure the vrints. When compred to the verge weighted distnce of the originl reference process model, it is on verge 17.92% shorter. Number of Chnge nd Move opertions For the 60 (out of 72) groups which we re ble to discover different reference model, we perform in totl 284 chnge opertions, i.e., on verge 4.73 chnge opertions per group. Within the 284 chnge opertions, there re 132 insert opertions nd 152 move opertions. Clerly, we see no significnt difference between the number of insert or move opertions. which indictes tht the two opertions re well-blnced by our lgorithm. Reson is tht we performed reltively sme mount of insert or move opertions when generting the dt set (cf. subsection 6.3) nd such trend is lso shown during the discovery of the reference model. Running Time Clerly, the number of ctivities contined in the vrints cn significntly influence the execution time of our lgorithm. The serch spce is simply lrger for lrge models since the number of cndidte ctivities for chnge is higher nd the number of blocks contined in lrge model is lso higher. We therefore nlyze the execution time of our lgorithm by considering the size of process models. The verge running time is summrized in Tble 4. smll-sized medium-sized lrge-sized Averge serch time (s) Averge # of chnges performed Model sizes Tble 4. Averge serch time for process models of different sizes While it tkes only little time to discover the result model for the smllsized nd medium-sized models, it tkes considerble longer time (on verge seconds) to find the result for the lrge-sized model. It tkes long not only becuse the process models re lrger, but lso becuse the serch steps re longer s well. We need to perform on verge 8.43 chnge opertions on the originl reference model to discover the end result. Reson is tht the vrints in these groups re more different from ech other thn those in the smll-sized nd medium-sized model (the number of chnge opertions is determined by the model size, i.e., we chnge respectively 10%, 20% nd 30% of ctivities to configure vrint (cf. subsection 6.3)). The consequence is tht the discovered model could lso be more different from the originl model. However, we believe the running time is cceptble considering the complexity of the problem, especilly when compred with other dt mining problems which might tke hours or even dys to compute [36].

38 Correltion of Delt Fitness nd Delt Distnce One importnt issue we wnt to investigte is how fitness vlue is correlted with verge weighted distnce. As we discussed in Section 6, fitness vlue is only quick guess of how close cndidte model S c is to the collection of vrints, it is not s precise s the verge weighted distnce nd cn lso not perfectly correlted with the verge weighted distnce since computing it is n N P problem. In this subsection, we will nlyze how much they re correlted with ech other. Our heuristic serch lgorithm is best-first pproch, i.e., we serch whether we cn find process model with higher fitness vlue, therefore, it is more useful to mesure how much the delt-fitness (the difference between the fitness vlues before nd fter chnge) is correlted with the delt-distnce (the difference between the verge weighted distnces before nd fter chnge), becuse it is improvement of the fitness vlue (delt-fitness) tht guide the serch steps (cf. Section 5). Another reson not to directly nlyze the correltion of the fitness vlue nd the distnce vlue is their vlue rnges. While, the fitness vlue of model hs vlue rnge [0, 1], the verge weighted distnce hs vlue rnge [0, + ]. On the contrry, the delt-fitness nd the delt-distnce both hve vlue rnge [ 1, 1] since we only chnge one ctivity t time. Therefore, the correltion between the delt-fitness nd delt-distnce is more resonble to consider. Similr techniques for evluting fitness function is lso widely used in evluting other lgorithms [14]. Since every chnge opertion will led to prticulr chnge on the process model nd consequently cretes delt-fitness vlue nd delt-distnce vlue, we obtin 284 dt smples since we performed in totl 284 chnge opertions. We plot these dt smple in Fig. 18 s (delt-fitness, delt-distnce). For exmple, consider the serch result in Tble 2. We cn obtin three dt smples of delt-fitness nd delt-distnce from it, i.e., (0.171,0.8), (0.04,0.6) nd (0.017,0.25). In Fig. 18, the x-xis is the delt-fitness vlue nd the y-xis is the deltdistnce vlue. All delt-fitness re lrger thn 0. It is esy to understnd since the lgorithm would only perform chnge if nd only if it cn find model with higher fitness vlue. However, the delt-distnce is not lwys lrger thn 0, which indictes tht sometimes the chnge opertion cn mke the result even worse. It is not surprising to see this since the fitness vlue is only quick guessing of the distnce. We dditionl plot line with delt-distnce being 0 to seprte good smples (positive delt-distnce) nd bd smples (negtive delt-distnce). In Fig. 18, we lso mrked the dt smple from groups of different model sizes seprtely. It is cler tht these three groups form three different clusters, i.e., they re not overlpping too much with ech other nd the lrger the model size is, the more its cloud positions towrds the y-xis. This indictes tht the size of process model cn influence the delt-fitness vlue. Reson is tht the fitness vlue is mesured by the verge of how much ech pir of order reltions mtches ech other in the cndidte model nd in the vrints (cf. Formul 5),

39 Good Bd Fig. 18. Construct the serch tree therefore since one chnge opertion only chnge one ctivity, it would hve higher influence on the smll model compred to on the big ones.

39 39 Good Bd Fig. 18. Construct the serch tree therefore since one chnge opertion only chnge one ctivity, it would hve higher influence on the smll model compred to on the big ones. Therefore, it is more resonble to nlyze the correltion of the groups of different sizes seprtely. We use Person correltion to mesure the correltion between the deltfitness nd delt-distnce [35]. Let X be the delt-fitness nd Y be the deltdistnce. We obtin n dt smples (x i, y i ) where i = 1,..., n. Let x nd ȳ be the men of X nd Y, let s x nd s y be the stndrd devition of X nd Y. The Person correltion cn be computed using Formul 6. r xy = xi y i n xȳ (n 1)s x s y (6) We dditionl tested whether the correltions re significnt, i.e., whether the correltion coefficients re significntly different from 0 [35]. The results re summrized in Tble 5. Number of dt Correltion Probbility of r = 0 Significntly Correlted Smll-sized <1.0E-8 Yes Medium-sized <1.0E-8 Yes Lrge-sized <1.0E-8 Yes Tble 5. Delt fitness nd delt distnce correltions of groups with different sizes

2 Computing all Intersections of a Set of Segments Line Segment Intersection

2 Computing all Intersections of a Set of Segments Line Segment Intersection 15-451/651: Design & Anlysis of Algorithms Novemer 14, 2016 Lecture #21 Sweep-Line nd Segment Intersection lst chnged: Novemer 8, 2017 1 Preliminries The sweep-line prdigm is very powerful lgorithmic design