Identification of Data Cohesive Subsystems Using Data Mining Techniques

Size: px

Start display at page:

Download "Identification of Data Cohesive Subsystems Using Data Mining Techniques"

Jonah Simpson
5 years ago
Views:

1 Identification of Data Cohesive Subsystems Using Data Mining Techniques Carlos Montes de Oca and Doris L. Carver Department of Computer Science, Louisiana State University Baton Rouge, Louisiana, USA (moca, Abstract The activity of reengineering and maintaining large legacy systems involves the use of design recovery techniques to produce abstractions that facilitate the understanding of the system. In this paper, we present an approach to design recovery based on data mining. This approach derives from the observation that data mining can discover unsuspected non-trivial relationships among elements in large databases. This observation suggests that data mining can be used to elicit new knowledge about the design of a subject system and that it can be applied to large legacy systems. We describe the ISA methodology which uses data mining to identify data cohesive subsystems. We were able to decompose COBOL systems into subsystems by using this approach. Our experience shows that data mining can identify data cohesive subsystems without any previous knowledge of the subject system. Furthermore, data mining can produce meaningful results regardless of system size making this approach especially appropriate to the analysis of large undocumented systems. 1. Introduction There are software systems that for many years have supported the activities of large organizations, yet the software systems have grown old. These systems, known as legacy systems, show signs of deterioration. Generally speaking, the term legacy systems refers to old software systems for which maintenance has become very expensive; however, they perform a critical task for the enterprise. The problem grows daily because legacy systems need continual maintenance to respond to the demands of the constantly-changing business environment. This situation promotes fast and unplanned modifications that inevitably make the problem greater. Consequently, the maintenance of legacy systems continues to grow more complex and expensive. One approach to cope with the legacy systems problem is reengineering. Reengineering addresses the problem in two stages. The first stage, called reverse engineering, focuses on understanding the legacy system. The second stage, called forward engineering, uses the information produced in the reverse engineering stage and adds new specifications to rebuild the legacy system using modern technologies. A relevant subarea of reverse engineering is design recovery, which focuses on producing meaningful high level abstractions from the subject system [1]. To this end, design recovery may use any available source of information such as source code, documentation, domain knowledge, and personal experience. Similarly, the identified abstractions may take different forms such as module breakdown, structure-charts, entity-relationship diagrams, and formal specifications. Design recovery also plays an important role in the areas of maintenance and reuse by providing information that simplifies the understanding of the system and the localization of reusable parts. As suggested above, there are many approaches to design recovery. They differ in many respects such as the sources of information they use, the extraction procedures and tools they employ, and the outcome they produce. In this paper, we present an approach to design recovery that is based on data mining. Data mining is the process of extracting patterns or models from data by applying specific algorithms [17]. We are interested in data mining techniques because they have features that are potentially helpful for reverse engineering and maintenance tasks. For instance, the data mining process finds nontrivial and previously unknown patterns in large databases. That is, data mining techniques are capable of revealing unknown classifications, associations, and sequences among data. Such information

2 may be used to identify objects, to detect reusable components, or to find novel ways to form relationships among system's components. This feature is especially helpful when dealing with large undocumented systems. In general, our research aims to use data mining techniques as a tool in the reverse engineering and maintenance domains. In particular, we explore the use of data mining techniques to recover the design of imperative legacy code. We describe a general method to apply data mining to design recovery tasks. The method consists of three steps: (1) design a database view of the subject system that is used as the input to the mining process, (2) select a data mining algorithm to mine the required knowledge, and (3) design algorithms that consolidate the results of the data mining process into a meaningful high level abstraction. To test the feasibility of this method we have developed the ISA methodology. ISA identifies data cohesive subsystems based on mined association rules. We have been able to apply ISA to decompose COBOL systems into a set of subsystems. Using data mining for design recovery offers two advantages. First, data mining is capable of producing a logical decomposition of a system without using supporting information such as documentation or a priori knowledge about the functionality of the system. Second, data mining is designed to deal with a large amount of information. Therefore, data mining can perform well regardless of the size of the subject system. These advantages are important, given the lack of documentation and the size of most legacy systems. The rest of the paper is organized as follows. Section 2 includes the related work. Section 3 explains the underlying ideas behind data mining and the motivation for pursuing this approach. Section 4 describes the ISA methodology followed by a case study in section 5. Section 6 contains the results of the case study, and section 7 presents the conclusions. 2. Related work Research in design recovery uses different approaches and produces diverse results. For example, although the preferred source of information to recover the design is the source code, some research uses non-code sources such as data flow diagrams [2] and structured specifications [3]. Works using the source code as a primary source of information utilize different approaches. Formal approaches use rigorous mathematical procedures and notations to extract a formal description of the subject system. The advantage of such a description is that it is precise, verifiable, and prone to automation. Works in this category are described in [4] and [5]. Knowledge-based works include adding knowledge representation of some type to the design recovery process [6]. Other research addresses legacy systems by identifying objects. That is, the goal is to extract an object-oriented representation of the subject system [7], [8]. Moreover, there has been an increased interest on object recovery or the identification of objects within procedural programs [9], [10]. Other related works focus on production of a hierarchical design description [11], identification of clichés [12], and generation of a SSADM (Structured Systems Analysis and Design Method) representation [13]. The idea of using a database representation of the subject system is not new. For example, Chen [14] generates a relational view of C code to support software activities such as graphical views, subsystem extraction, binding analysis, dead code elimination, and program layering. Narat [15] uses a database to support maintenance activities of source code by producing crossreference documentation. Grass [16] uses CIA++ (C++ Information Abstractor) to extract design information from C++ programs. CIA++ constructs a relational database that contains information obtained from C++ programs. Her aim is to do object recovery by querying the relational database created by CIA++. Although these works use a database representation of the subject system, data mining techniques are not used to extract design information. 3. Data mining and design recovery The underlying idea behind data mining is the extraction of useful information from large databases. For years, organizations have been collecting data as part of their normal operations. Consequently, they have created large databases that contain meaningful information for the organization such as classifications, trends, and patterns. This information is a powerful base for making decisions, forecasts, and planning. Unfortunately, traditional querying systems are not capable of extracting this "hidden" information; thus, data mining technology emerged to extract such information. The overall process of discovering useful knowledge from large volumes of data is known as knowledge discovery in databases [17]. This process encompasses several steps that range from understanding the problem domain and data preparation to interpreting the mined patterns and consolidation of the discovered knowledge. Data mining is one of these steps which consists of applying appropriate data mining algorithms to extract the desired patterns from the databases. In this context, patterns refer to expressions that represent facts about the data contained in the database [18]. An example of these patterns is s% of the customers that purchase item A in a visit purchase items B and C in the following visit.

3 The objectives of data mining algorithms include: Classifications. The objective is to classify a data item into one of several predefined classes. For example, data mining classification algorithms can be used to identify particular objects in huge image databases. Clustering. The idea is to group data items to form classes or clusters of data items according to some similarity function. In this case, the data mining algorithm defines the classes as opposed to classification where the classes are predefined. For instance, data mining clustering algorithms can be used to identify groups of homogeneous people to help develop a marketing plan. Associations. The objective is to find association rules of the form c% of the customers that buy product A also buy products B and D. This kind of information can be used to design the floor plan of the store, the marketing strategy, or even to forecast inventory levels. Sequences. The idea is to find sequences of events. For example, if event A occurs, then c% of the time events B and D occur within the next t units of time. This information can be used to forecast equipment failures and stock booms. Consequently, there are different data mining algorithms. Each algorithm aims to mine a specific pattern (knowledge). We are particularly interested in data mining due to the following observations. First, data mining can discover unsuspected non-trivial relationships among data elements in large databases. Second, data mining techniques are capable of mining relevant information regardless of the previous knowledge of the object of study. Finally, data mining is designed to work with a large amount of information. These features of data mining suggest that this technique has potential as a valuable tool for reengineering and maintenance tasks. For instance, the first observation suggests that if a software system were represented as a database, it might be possible to apply data mining to unveil relevant relationships among systems components. In other words, data mining could elicit new knowledge about a subject system. This information could be used to support diverse reverse engineering and maintenance tasks such as design recovery, object extraction, identification of reusable parts, and detection of repeated code. The second observation suggests that data mining techniques have the potential to produce novel ways to relate systems' components even without previous knowledge of the system s functionality and implementation details. Finally, the last observation implies that data mining can analyze large software systems at no detriment to performance. Indeed, the larger the system the better chances that data mining will produce significant information. Therefore, this approach seems well suited for reengineering and maintaining legacy systems. 4. The ISA methodology This section describes our approach to design recovery. We first describe a general method to apply data mining to design recovery. Then, we present an instance of the method, called the ISA methodology (Identification of Subsystems based on Associations). The general method consists of the following three steps: 1. Define a database view of the system. A database view of the system is a representation of the system or a subset of it using a database. The data to be loaded into the database comes from any source of information, primarily from the source code (e.g., variables, programs, modules, files). The selection of the database view determines the type of information that the data mining algorithms can mine. Consequently, the selection of this view is critical to the success of the mining analysis, and it is done with the selection of the particular data mining algorithm in mind. 2. Perform data mining. This step involves the selection and use of data mining algorithms to mine the database view of the system. The selection of data mining algorithms depends on the specific information requirements of the design recovery process, such as associations, sequences, classifications, or clusters. 3. Consolidate and interpret results. The outcome of the mining process is combined into meaningful knowledge to construct the design of the system. This method can be applied at different granularity levels. For instance, it has potential to be used to analyze a program, a module, or a system. In addition, it can produce diverse high level abstractions depending on the specific instantiation of the database view, the mining algorithms, and the consolidation procedure. In order to apply the method, we developed the ISA methodology. ISA is a system level methodology whose objective is to produce a decomposition of a system into data cohesive subsystems. In this context, a subsystem is a subset of programs and files. A data cohesive subsystem is formed by programs that use the same persistent data repositories (i.e., data files). ISA instantiates the three-step method as follows. In the first step, ISA represents the database view of the system as a set of tuples. In the second step, ISA mines association rules. In the last step, ISA builds a table based on the associations found. Then, the table is used to

4 identify subsystems. Before describing each step, we introduce some definitions. Let 6 be a software system composed of a set of programs 3 and a set of data files ). For example, a payroll system written in COBOL may consist of several programs and several files. The system would likely include programs to print the payroll, to print checks, to perform the calculations, and to report tax withheld. In addition, the system would contain the employee file, the salaries file, and the scheduling file. For simplicity, this definition of a system does not include script files, and JCL (Job Control Language) scripts. A program p uses a file f if p reads or writes information on f. Let U(p, )) denote the number of files f ) that the program p uses. For example, if ) ={A, B, D, H, G}, 3 ={m, t, w, x}, and program x uses files A, D, and G, then U(x, ))=3. Similarly, Q(f, 3) represent the number of programs p 3 that use file f. That is, if programs t and w use file B, then Q(B, 3) = 2. We describe each of the three steps of the ISA methodology in sections 4.1, 4.2, and Define a database view of the system The ISA methodology performs a data preprocessing before defining the database view of the system. The objective is to generate a clean data set for the mining process. The resulting data set is known as the alpha set. The alpha set is the set A={P, F} such that P 3, F ), P = {p U(p, F) > γ}, and F = {f Q(f, P) > β} where γ, β are integers and γ > 0, β > 0. The alpha set contains programs that use more than γ files and files that are used by more than β programs. These parameters are necessary to avoid introducing noise to the analysis. For example, if γ = 0 were allowed, the alpha set would include programs that use just one file. These programs do not provide information to form associations among files. The alpha set is not produced just by removing from 6 the files and programs that do not satisfy the β and γ constraints. It is possible that by removing a program or file from 6 another program or file would not satisfy the β and γ constraints. Thus, the generation of the alpha set is an iterative process. The algorithm in Figure 1 produces the alpha set. Having the alpha set, the database view of the system is defined as the set of tuples T = t 1, t 2,..., t F where t i ={ p P p uses f i }. There is a tuple t for each file f F. A tuple t i contains all the programs that use file f i Perform data mining The data mining task used in this step is the search of association rules on the alpha set. Agrawal, Imielinski, and (1) done = false, F=), P=3, F =, P = (2) do (3) F = {f F Q(f, P) > β} (4) P = {p P U(p, F ) > γ} (5) if (F = F) and (P = P) then (6) done = true (7) elseif (8) F = F (9) P = P (10) until (done) (11) A={P, F} Figure 1. Algorithm to obtain the alpha set Swami [19] introduced the problem of mining association rules from large databases of transactions. Their original idea is to find associations among the items a customer buys. For instance, having a large database of transactions, each transaction containing all the products purchased by a customer in a particular visit, the goal is to produce rules of the form 90% of the times a customer buys milk, she also buys bananas. Formally, the problem of mining association rules is defined as follows [19]: Let I={i 1, i 2, i 3,, i m } be a set of items. D is a set of transactions R such that R I. Additionally, let say that R contains X if X R. An association rule is an implication X Y, where X I, Y I, and X Y =. The rule X Y holds in D with confidence c if c% of transactions in D that contain X also contain Y. In addition, the rule X Y has support s if s% of the transactions in D contain X Y. Then, the problem of finding association rules in a set of transactions consists on finding all the association rules having s > minsup and c > minconf. Minsup and minconf are user-supplied parameters representing the minimum required support and confidence respectively. In the ISA notation, P represents the set of items I, T is the set of transactions D, and a tuple t is a transaction R. Mining the alpha set for association rules produces associations of the form s[p 1, p 2,..., p n ], where s is the support of the association, n P, and p i P. In other words, s is the number of tuples in T that contain p 1, p 2,..., p n. The interpretation of such an association rule is that programs p 1, p 2,..., p n use the same s files. The association does not mean that all the programs in the association use just the s files; rather, it means that those s files are common among the programs in the association. Thus, an association provides the rationale to form groups of programs based on the common data repositories the programs use Consolidate and interpret results Once the associations are mined they are used to form groups of programs and files (i.e. subsystems). To this

5 end, a grouping table is built. This grouping table organizes programs and files in rows and columns, respectively. For each program in the alpha set there is a row in the grouping table. Similarly, for each file in the alpha set there is a column in the grouping table. The intersection of a row with a column is marked if the program represented by the row uses the file represented by the column. The construction of the grouping table is a bottom-up iterative process. In each iteration, new rows are added to the grouping table. Then, the rows are reorganized to keep the programs that use similar files in adjacent rows. Likewise, the columns are rearranged to maintain the files that are used together in adjacent columns. This process continues until all the programs are represented in the grouping table. The algorithm in Figure 2 builds the grouping table. Some heuristics are required in lines 3, 7, and 14 of the algorithm. For example, let a i ={p 1, p 2 } where a i $. Assume that the algorithm is in iteration i, that p 1, and p 3 are already in the grouping table, and that p 1 is in row 10 and p 3 is in row 11. Assume also that s i =14, and that M(p 1, p 3 ) = 16. Line 3 of the algorithm requires putting all p a i in adjacent rows. That is, p 2 has to go adjacent to p 1. However, M(p 1, p 3 ) > M(p 1, p 2 ). In this case, we add a new row before row 10 and put p 2 in this new row, and then renumber the rows. Once the table has been built, it contains a grouping of programs and files. The programs that use a similar set of files are in adjacent rows and the files that are likely to be used together are in adjacent columns, thereby identifying groups of programs that use the same set of files. In this sense, ISA identifies data cohesive subsystems. The final grouping might contain files that are used by more than one group of programs (subsystem). This situation is normal because the subsystems are interrelated. These files can be seen as the interfaces among subsystems. Consequently, the ISA methodology decomposes a software system 6 into k subsystems Z i ={G i, H i } where G i 3 and H i ), for i= 1,2,, k, and G i G j =, H i H j = for i, j = 1, 2,, k, and i j. This decomposition does not include all the programs and files in 6 because some programs cannot be classified into any subsystem and some files are used by several subsystems Therefore, G 1 G 2 G k may not be equal to 3, and H 1 H 2 H k may not be equal to ). Although this subsystem decomposition produces disjoint sets of programs and files, it does not imply that a program in subsystem Z i cannot use a file in subsystem Z j. 5. Case study We have applied the methodology to COBOL systems. For simplicity, we discuss the details of applying ISA to a small system, known as the TRS system, consisting of approximately 25 KLOC distributed in 28 COBOL programs. TRS also uses 36 data files. We started by creating the alpha set. First, a number was assigned to each file and to each program (e.g., programs p1 to p28, and files f1 to f36). Then, we used the Let M(p x, p y ) denote the number of common files between p x and p y (1) Let $ ={a 1, a 2, a 3,, a k } be the set of associations sorted by s (i.e., s 1 s 2 s k ) (2) for i = 1 to k (3) Put in adjacent rows the programs p a i (4) for each p a i (5) mark the columns that represent a file used in p (6) endfor (7) Put in adjacent columns the files that are common to all the programs in a i (8) if all the programs have been included in the table then (9) STOP (10) endif (11) endfor (12) for each p not included in the grouping table (13) Find a program q in the grouping table such that M(p, q) is maximum (14) Put p in an adjacent row to program q (15) endfor Figure 2. Algorithm to build the grouping table

6 algorithm defined in section 4.1 to create the alpha set. We ran the algorithm with α = 1, and β = 1. The resulting alpha set consisted of 22 programs and 24 files. Next, we created the database view in an ASCII file. Each line in the ASCII file contained a tuple as defined in section 4.1. A tuple consisted of the file number followed by the numbers of the programs that use that file. For example, the tuple [29] means that file 29 is used by programs 16, 18, 19, and 20. There was a record in the ASCII file for each file in the alpha set. Note that the programs numbers within a tuple were sorted due to a requirement of the data mining algorithm that we used. We applied the Apriori [20] algorithm to mine association rules. Apriori mines association rules in two steps. First, Apriori finds all the item sets with transaction support greater than the threshold value minsup. That is, it finds all the sets of items contained in more than minsup transactions. These item sets are called large itemsets. Second, Apriori, generates the association rules based on the large itemsets found in the previous step. Apriori does several passes over the data set. In pass k, Apriori finds large itemsets of size k called k-itemsets. Specifically, for each iteration k, Apriori generates candidate sets using the (k-1)-itemsets and then traverse the tuples to calculate the support for each candidate set. The candidate sets with support > minsup form the k-itemset. The process continues until Apriori cannot generate more candidate sets. For a detailed description of the Apriori algorithm refer to [20]. Our version of Apriori does not use the minisup parameter as a percentage but as a number. Moreover, our version of Apriori just finds large itemsets. We ran Apriori using minsup = 3 to mine for associations with support equal or greater than 3. We obtained 538 associations. Then, we sorted the associations by support. The top 10 associations are shown in Table 1. The first association 16[16 18] means that programs 16 and 18 use the same 16 files. If we consider that program 16 uses 17 files and program 18 uses 18 files, the first association implies that these programs share more than 85% of the files they use. The third association 14[ ] means that programs 16, 18, and 19 use the same 14 files (program 19 uses 16 files). This information could indicate that these programs perform similar or complementary functions. Thus, this set of programs and their 14 common files can be grouped in one subsystem. Next, we built the grouping table. We started with the first association in Table 1. We created the first two rows of the table. Row one represented program 16 and row two program 18. Each of the 24 columns represented one file. We put a 1 in the columns that represent files used by programs 16 and 18 (Figure 3). Then, we considered the second association and drew a third row representing program 19. The third association was trivial since we only had three rows. However, we used this information to arrange the columns in such a way that the 14 common files were in adjacent columns. Similarly, the fourth association was used to arrange columns. The fifth association produced rows 4 and 5 (Figure 4). We continued in this way until we had used all associations. Finally, we appended to the table the programs that were not included in the associations but were in the alpha set. The complete table is shown in Figure Results No s Programs Table 1. Associations with largest support The grouping table in Figure 5 was used to identify subsystems. The table contains 5 groups of programs. The first group contains programs 17, 16, 18, and 19. Programs 16, 18, and 19 use 11 files that are not used by any other program in the system. Moreover, files 30, 25, 17, 9, and 18 are used just by programs in this group. The second Files p p Figure 3. First two rows of the grouping table

7 Files p p p p p Figure 4. Grouping table with five rows group contains programs 23, 28, and 26. Programs 23 and 28 use the same set of files. Program 26 uses a subset of the files that programs 23 and 28 use. In addition, file 28 is used only within this group. A similar analysis can be applied to the rest of the groups with the exception of the last group. The fifth group contains programs that could not been classified into the rest of the groups (i.e., programs 20 and 21). We also found this type of result when analyzing other software systems. Generally, the number of unclassified programs is small compared with the total number of programs in the system. Therefore, we consider these programs as exceptions, and do not consider group 5 as a subsystem. As previously indicated, some files are used across groups. For example, file 26 is used in all subsystems, file 23 is used in three subsystems, and files 3 and 5 are used in two subsystems. These files could be seen as communication buffers or links among subsystems. 7. Conclusions We have presented our initial work of using data mining techniques to recover designs of software systems. We proposed a general three-step method that can be used as a framework to apply data mining at different granularity levels and to produce different high level abstractions. In addition, we described an instantiation of this method, the ISA methodology, which decomposes a software system into data cohesive subsystems by mining association rules. Our experience shows that data mining can be used to produce a logical decomposition of a software system. Data mining offers the advantage that it can identify data cohesive subsystems without any knowledge of the subject system. The only required source of information is the source code. Moreover, data mining is capable of Files p p p p p p p p p p p p p p p p p1 1 1 p p2 1 1 p9 1 1 p p Figure 5 Grouping table

8 producing meaningful results regardless of the size of the system. These properties of data mining make this approach especially appropriate to the analysis of large undocumented software systems. Furthermore, the approach has great potential for automation as shown by the ISA methodology. Therefore, data mining has potential as a valuable tool for the reverse engineering and maintenance domains. The next steps are the modification of the methodology to facilitate its automation and scalability, and the definition of a graphical model to represent the information generated by the methodology. References [1] E.J. Chikofsky, J.H. Cross II, Reverse Engineering and Design Recovery: A Taxonomy, IEEE Software, Vol. 7, No 1, Jan. 1990, pp [2] G. Butler, P. Grogono, R. Shinghal, I. Tjandra, Retrieving Information from Data Flow Diagrams, in Proc. Second Working Conference on Reverse Engineering, IEEE Computer Society Press, July 1995, pp [3] J.C.S.P. Leite, P.M. Cerqueira, Recovering Business Rules from Structured Analysis Specifications in Proc. Second Working Conference on Reverse Engineering, IEEE Computer Society Press, July 1995, pp [4] K. Lano, P.T. Breuer, H. Haughton, Reverse-engineering COBOL via Formal Methods, Journal of Software Maintenance: Research and Practice, Vol.5, 1993, pp [5] A. Cimitile, A. De Lucia, M. Munro, Qualifying Reusable Functions Using Symbolic Execution, in Proc. Second Working Conference on Reverse Engineering, IEEE Computer Society Press, July 1995, pp [6] D.R. Harris, H.B. Reubenstein, A.S. Yeh, Reverse Engineering to the Architectural Level, in Proc.17 th Int l Conf. on Software Engineering, IEEE Computer Society Press, 1995, pp [7] I. Jacobson, F. Lindström, Reengineering-engineering of old systems to an object-oriented architecture, in Proceedings of OOPSLA, 1991, pp [8] P. Newcomb, G. Kotik, Reengineering Procedural Into Object-Oriented Systems, in Proc. Second Working Conference on Reverse Engineering, IEEE Computer Society Press, July 1995, pp [9] H. Gall, R. Klösch, Finding Objects in Procedural Programs: An Alternative Approach, in Proc. Second Working Conference on Reverse Engineering, IEEE Computer Society Press, July 1995, pp [10] H. M. Sneed, E. Nyáry, Extracting Object-Oriented Specification from Procedurally Oriented Programs, in Proc. Second Working Conf. on Reverse Engineering, IEEE Computer Society Press, July 1995, pp [11] S. C. Choi, W. Scacchi, Extracting and Restructuring the Design of Large Systems, IEEE Software, Vol. 7, No 1, Jan. 1990, pp [12] C. Rich, L. M. Wills, Recognizing a Program s Design: A Graph-Parsing Approach, IEEE Software, Vol. 7, No 1, Jan. 1990, pp [13] H.M. Edwards, M. Munro, RECAST: Reverse Engineering from COBOL to SSADM Specification, in Proc. of the Working Conference on Reverse Engineering IEEE Computer Society Press, 1993, pp [14] Yih-Farn Chen, M. Y. Nishimoto, C.V. Ramamoorty, The C Information Abstraction System, IEEE Transactions on Software Engineering, Vol. 16, No. 3, Mar. 1990, pp [15] V. Narat, Using a relational database for software maintenance: a case study, in Proc. IEEE Conference on Software Maintenance CSM-93, IEEE Computer Society Press, 1993, pp [16] J.E. Grass, Object-Oriented Design Archaeology with CIA++, Computing Systems: The Journal of the USENIX Association, Vol. 5, No. 1, Winter 1992, pp [17] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, The KDD Process for Extracting Useful Knowledge from Volumes of Data, Communications of the ACM, Vol. 39, No. 11, Nov. 1996, pp [18] U. Fayyad, G. Piatestsky-Shapiro, P. Smyth, From Data Mining to Knowledge Discovery: An Overview, U. Fayyad, G. Piatestsky-Shapiro, P. Smyth, R Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining, chapter 1, AAAI/MIT Press, [19] R. Agrawal, T. Imielinski, A. Swami, Mining Association Rules between Sets of Items in Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, May 1993, pp [20] R. Agrawal, R.Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proc. 20 th Int l Conf. on Very Large Data Bases (VLDB 94), Sep. 1994, pp

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth