Clomaint: a New Data Mining Algorithm in Maintenance Petroleum Plants

Size: px

Start display at page:

Download "Clomaint: a New Data Mining Algorithm in Maintenance Petroleum Plants"

Mervin Craig
5 years ago
Views:

SETIT 2009 5 th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications March 22-26, 2009 TUNISIA Clomaint: a New Data Mining Algorithm in Maintenance

1 SETIT th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications March 22-26, 2009 TUNISIA Clomaint: a New Data Mining Algorithm in Maintenance Petroleum Plants Salim KHIAT *, Sid Ahmed RAHAL * and Hafida BELBACHIR * * Université Mohamed Boudiaf USTO ORAN- ALGERIE Salim_khiat@hotmail.com h_belbach@yahoo.fr Rahalsa2001@yahoo.fr Abstract : Plant maintenance produces daily large amounts of repairing data which contain hidden and valuable knowledge. This knowledge can be used for reduce the cost of maintenance, the time of intervention, in extreme case for saving the human life. In this paper, we designed an algorithm named CLOMAINT which is based on the ideas of the pattern-growth method for mining closed frequent itemsets calling patterns of SONATRACH maintenance database. First, by observing the features of the database and then extracting the attributes needed to be mined. Then, we merge the items to form an itemset, the algorithm is now applied to the transformed database. CLOMAINT is built on the basis of viewing knowledge discovery as an interactive and iterative process in order to optimize decision making. The quality of the knowledge discovered is evaluated. The experimental results show the knowledge of closed frequent patterns obtained is very useful and easy to interpret by the plant maintenance operators. Keywords : Closed frequent pattern, Fp-tree, Knowledge, Maintenance database. INTRODUCTION Databases are a dormant potential resource that, tapped, can produce the substantial benefits. Data mining, which is also referred to as knowledge discovery in databases, is the non-trivial extraction of implicit, valid, previously unknown, potentially useful and ultimately understandable patterns from huge amounts of data for use in making crucial decisions. Among the research tasks in excavation of data, the extraction of the association rules is undoubtedly the important task which drew more the attention of the researchers and for which many work were carried out. On the one hand this technique allows the discovery of understandable and exploitable rules in a whole of bulky data, rules expressing of associations between items or attributes in a data base. The purpose of the extraction of the association rules is to discover significant relations between binary attributes extracts of the data bases. An example of association rules extracted from a data base of sales of supermarket is: cereals and sugar milk (support 7%, confidence 50%). This rule indicates that the customers precision of the rule, i.e. the proportion of customers who bought milk among those which bought cereals and sugar. The extraction of association rules consists in extracting the rules whose support and confidence are at least equal to minimal thresholds of support and confidence defined by the user.. The extraction of association rules is an iterative process and interactive made up of several phases active of the selection and the preparation of the data until interpretation of the results, while passing by the phase of research of knowledge (extraction of the frequent sets of attributes and generation of the association rules). An efficient association rules algorithm was divided in two sub-problems: frequent set mining from databases and association rules generation. Frequent sets lie at the basis of many data mining algorithms. As a result, hundreds of algorithms have been proposed in order to solve the first problem: the frequent set mining problem. who buy cereals and sugar also tend to buy milk. The measurement of support defines the range of the rule, i.e. the proportion of customers who bought the three articles, and confidence measures defines the - 1 -

In classical frequent pattern mining algorithms [AGR94], huge number of frequent patterns are generated, whereas among them exist redundant information.

2 In classical frequent pattern mining algorithms [AGR94], huge number of frequent patterns are generated, whereas among them exist redundant information. To reduce the redundancies found, one can specify that the frequent patterns must be closed. Closed frequent pattern mining [PAS 00] however, solves this problem by returning a succinct result with redundancies being removed. The scenario below gives a good understanding of closed frequent patterns. Suppose the frequent patterns generated are: {bread, butter: 10}; {sugar, butter: 10}; {bread, sugar: 10}; {bread, sugar, butter : 10}. Closed frequent pattern mining will return one itemset only: {bread, sugar, butter: 10}. This itemset, however, represents the complete information about the frequency of its three subitemsets. The remaining of the paper is organized as follows. In section 1, we introduce the new concept in the maintenance strategy of SONATRACH: Anticipatory Maintenance. In section 2, the problem of mining frequent closed itemsets is defined. In section 3, we introduce the related works for extracting frequent closed itemset. In section 4, we introduce the benefits of using the Patricia tree. Our new algorithm CLOMAINT is defined in section5. Section 6 reports the experimentations and some results. We conclude our work in section Anticipatory maintenance (AM) Equipment failure is expensive. Maintenance of aircraft, chemical and petroleum processing plants, weapon systems and early warning systems are examples of life threatening potentials if a failure were to occur. Even when not life threatening, equipment failure frequently results in expensive down-time. SONATRACH adopt fourth strategy for equipment maintenance and repair [SON 93] fig.1: 1.1. Curative maintenance (CU. M) It is the form of the most primitive maintenance and most spontaneous. It consists with the intervention after the appearance of breakdowns or anomalies Preventative maintenance (PR.M) Has long been used to replace parts at scheduled intervals. The (PR.M) approach has been based primarily on lifetime estimates for particular parts, and then replacing those parts at scheduled intervals before they exceed their lifetime estimate Predictive maintenance (PD.M) It is the application of the measurement techniques on the servicing equipment. These techniques make it possible to diagnose the state of equipments in order to judge the advisability of launching the preventative maintenance action or of deferring it on rational bases. It is based on the analysis: - State of the equipment (corrosion, escape, cracks, anchoring ) - Non destructives tests (magnetic, ultrasound, liquid penetrating, taken thickness ) - Parameters of process (temperature, pressure, flow ) - Oils of lubrication and sealing. - Temperatures of bodies of the revolving machines. - Vibrations and noises generated by the bodies of the revolving machines Programmed Stops (P.S) It is one of the forms of maintenance the most used in the industry of hydrocarbons. It is closely related to the technical lawful requirements and safety. It concerns in most of the time equipment whose their maintenance is impossible during their operation. Curative Maintenance Preventative Maintenance Predictive Maintenance Programmed stops Maintenance DataBase Anticipatory Maintenanc FIGURE 1. GENERAL ARCHITECTURE We can note the absence of the implication of the data bases in the process of maintenance. The quantity of data available in the data base of maintenance contains a useful and hidden knowledge, for date the analysis and the interpretation of its data remain very difficult. So it is necessary to use the tools and very powerful techniques to extract knowledge which helps the company to develop. For that, the recourse to technical of Data Mining is primary importance. We introduce a new concept in maintenance, the Anticipatory Maintenance (A.M). It augments traditional (CU.M, PR.M, PD.M, P.A) by using data mining to predict parts that are likely to fail. (A.M) uses a data mining tool that finds affinities between repairs or affinities between reports and subsequent repairs. Quiet simply, anticipatory maintenance is the marriage between data processing and industry. According to the history of the breakdowns of the equipment, one can extract from the relations or correlations between the breakdowns which go sets. This makes it possible to reduce the cost of - 2 -

maintenance enormously. For example, each time joint X breaks down, the compressor C breaks down. In this case a breakdown of a few dollars generates breakdowns of million dollars.

3 maintenance enormously. For example, each time joint X breaks down, the compressor C breaks down. In this case a breakdown of a few dollars generates breakdowns of million dollars. For that, the maintenance operator will have to give the importance at article X all by changing it regularly even before the preventive maintenance. Thereafter, one will introduce a new algorithm CLOMAINT applied to the plant maintenance. For that, one presents initially a related works on different algorithms used in the approach from the closed frequent itemsets. After, we propose a new algorithm CLOMAINT of extraction of frequent itemsets and generate association rules. great advantage that closed itemsets are fewer than frequent ones. We are only interested in the database generated by one plant of SONATRACH. Table 1 show an example of selected real data where N_DEMANDE identifies the number of intervention to repair equipment. Date_ETA is the date of the intervention. EQUIPEMENT is the code of the equipment to be repaired. DESC_EQUI is the name of equipment. CODE_ARTI is the code of article to be repaired. DESI_ARTI is his name. CLASSE_ARTI identifies the class of article. 2. Problem definition 2.1. Definition: Frequent itemset Let I = {x1 xn} be a set of items. An itemset X is a subset of items, i.e., X I. For the sake of brevity, an itemset X = {x1, x2, xm} is also denoted as x1, x2, xm. A transaction T = (tid, X) is a 2-tuple, where tid is a transaction identifier (i.e. customer identifier) and X, an itemset. A transaction T = (tid, X) is said to contain itemset Y if Y X. A transaction database, TDB is a set of transactions in TDB containing itemset X, the support of an itemset X is the number of times it occurs in a transaction, denoted as supp(x). Given a transaction database TDB and a support threshold, min_sup, an itemset X is a frequent pattern, if supp(x) min_sup Definition: Closed frequent itemset A frequent itemset, X is now said to be a closed itemset if there exists no X such that X is a proper superset of X and every transaction containing X also contains X. A closed itemset is frequent if it passes the given support threshold. Knowledge about closed frequent patterns is interesting and useful when the right algorithm is used. Another approach of mining closed frequent patterns which adopts the methodology of pattern growth methods and avoids the candidate generation and test approach is hereby proposed. A pattern growth uses the Apriori property, however, instead of generating candidates-sets, it recursively partitions the database into sub-databases according to the frequent patterns found and searches for local frequent patterns to assemble local ones [JIN 00]. In this paper, we design an algorithm named CLOMAINT which is based on the idea of the pattern growth method, with completeness property. We use this algorithm to find effectively closed frequent patterns from an industrial database which are small subsets of frequent itemsets, but they represent exactly the same knowledge in a more succinct way. From the set of closed itemsets, it is in fact straight-forward to derive both the identities and supports of all frequent itemsets. Mining the closed itemsets is semantically equivalent to mine all frequent itemsets, but with the TABLE 1. AN EXAMPLE OF SELECTED REPAIR DATA 3. Related works A description of all the algorithms of extraction of closed frequent reasons is almost impossible considering their significant numbers. On the other hand, all these algorithms rest on two approaches: - The approach "test and generate" represented by algorithm CLOSE[BAS 00] and A-CLOSE[PAS 00]. - Approach "divide to reign" represented by algorithm CLOSET [JIA 00]. - Close and A-Close was the first algorithm for closed itemset mining based on the Apriori heuristic. A-close is a variation of Apriori, it adopts the Apriori framework, but looks for frequent closed itemsets and prunes the frequent itemsets that are not closed. The major cost of the A-Close is from two aspects: (1) it has to generate a lot of candidates and scan the transaction database again and again to count candidates; and (2) in the last scan to compute closures, there could be a large number of surviving frequent itemsets. For each transaction, the intersection with each surviving frequent itemsets is done. This makes the closure computation quite costly. - The authors of FP-growth proposed CLOSET for mining closed frequent patterns. This algorithm inherits from FP-growth, the compact FP-Tree data structure and the exploration technique based on recursive conditional projections of the FP-Tree. Frequent single items are detected after a first scan of the dataset, and - 3 -

4 with another scan, the pruned transactions are inserted in the FP-Tree stored in the main memory. Despite the efficiency of this FPgrowth, if the database is huge, the FP-tree will be large and the space requirement for recursion is a challenge [JIN 00]. CLOSE employment a research in width to find the itemsets closed frequent. In a dense data base or too long itemsets (which is the case of our databases), the technique of research in width can be ineffective because several candidates can be generated and several traverses data base are necessary. CLOSET builds a frequent tree of reasons in a structure FP- TREE. Although, CLOSET uses several optimizations to prune the space of research. CLOSET exceeds the other algorithms in time of extraction of the frequent reasons and especially in dense data (industrial databases). For this, we propose an algorithm CLOMAINT which take his the improvement of the algorithm CLOSET. CLOSET present two principals disadvantage: 1- Multiple scan of databases (about four scan). 2- FP-Tree can not fit in main memory. Our algorithm CLOMAINT need two scan of database and use a compressed Patricia tree to store the dataset, which provides a space-efficient representation for dense datasets. Indeed, by featuring a smaller number of nodes than the standard FP-tree, the Patricia tree exhibits lower space requirements, especially in the case of dense datasets. 4. Representing the dataset as a patricia tree In this section, we compare the Patricia-Tree structure with PF-Tree in order to determine what is more accurate for different databases (dense or sparse) FP-Tree [JIN 00] The FP-Tree structure consists of a set of prefix subtrees under a root node labeled as null and a header table containing frequent items. Every header table entry points a node in the FP-Tree carrying the same item name and every node on the FP-Tree points to the next occurrence of this item Patricia-Tree[PIE 03] A Patricia-Tree is a compressed FP-Tree. We keep the same representation as an FP-Tree but we merge every parent node with his single child node when they have the same support value. Contrarily to an FP-Tree node that represents a single item a Patricia-Tree node can represent several items. TID Items 1 A C D 2 B C E 3 A B C E 4 B E 5 A B C E 6 B C E TABLE 2. SAMPLE DATASET D C 1 A 1 C A Root Racine B 5 FIGURE 2. D REPRESENTED BY FP-TREE 1 B 5 FIGURE 3. D REPRESENTED BY PATRICIA-TREE E 1 C 4 E 1 C E E 4 A 2 For these reasons, we propose to adapt the Patricia- Tree proposed in [PIE 03] to generate frequent closed itemsets. For example, consider the dataset represented by table 2 with minsup set to 2/6. Its representations by the FP-Tree and the Patricia-Tree are given respectively by fig.2 and 3. In Patricia tree there are five nodes when FP-tree has seven nodes. The above figures show that a Patricia-Tree is more compact than an FP-Tree. 5. Algorithm design and implementation This section presents the complete description of the algorithm. Algorithm: Mining closed frequent patterns integrating the advantages of the pattern growth methods. 4 A 2-4 -

5 CLOMAINT Algorithm Input : Transaction database D, which consists of set of items and the support minsup Output: Closed frequent itemsets. Methode : 1- Initialization: Let FCI is a set of closed frequent itemsets. FCI 2- Find closed frequent itemsets recursively : Call CLOMAINT_SUB(, D, FCI) Procedure CLOMAINT_SUB (X, DB, FCI) Parameters and variables : - X : itemset fréquent if DB is X_conditional database else X= if DB is D - DB : conditional database - F_list : list of frequents item of the header table of Patricia tree. - FCI : Closed frequent itemsets found. Method : 1- Build Patricia Tree for DB. // Optimization1 2- Let Y, one single branch contain node Y attached to the root, insert X U Y to the FCI if X U Y is not a subset of itemsets in FCI with the same support. // Optimization 2 3- Apply optimization 3 to extract closed frequent itemsets if it is possible. 4- For each remaining item i in the header table, begin with the last and call CLOMAINT_SUB (ix, DB i,fci) if ix is not a subset of any frequent closed itemset already found with the same support count, where DB i is i- conditional database with respect to DB. //Optimization 4 Optimization 1 : Compress transactional and conditional databases using Patricia tree structure Patricia-tree is a prefix tree structure, representing compressed but complete frequent itemset information for a database. Its construction is simple. The transaction with the same prefix share the portion of a path from the root. Similarly, conditional Patricia tree can be constructed for conditional databases. There are the following benefits for using Patricia tree in the close itemsets computation. - Patricia tree compresses databases for frequent itemset mining. Transactions sharing common prefix paths of a branch of the tree will not create any new nodes in an FP-tree. Moreover, the deeper the recursion in the construction of conditional databases, the better chance of sharing, and the more compact the conditional Patricia tree. - Conditional databases can be derived from Patricia-tree efficiently. Since Patricia tree may compress multiple transactions into one path, the projection of this path is equivalent to the scan of multiple transactions. Optimization 2 : Extract items appearing in every transaction of conditional database If there exists one single branch contain node Y attached to the root of X-conditional databases, insert X U Y to the FCI if X U Y is not a subset of itemsets in FCI with the same support count. Optimization 3 : Directly extract frequent closed itemsets from Patricia tree If there exists a single prefix path in a Patricia tree, some frequent closed itemsets can be extracted directly from the conditional databases. Optimization 4 : Prune search branches Let X and Y be two frequent itemsets with the same support. If X Y, and Y is closed itemset, then there is no need to search the X-conditional databases because there is no hope to generate frequent closed itemset from there. After extracting the set of closed frequent itemset, we use the confidence measure for generating the association rules like algorithm Apriori[AGR 94]. 6. Experimental results In this section we evaluate the correctness of the closed frequent repairs equipment discovered by algorithm CLOMAINT, and the rules generated. Without loss of generality, we select 8 years old maintenance database of size records and 7 columns, tested on different minimum support in order to show the flexibility of the algorithm. All tests were performed on a 3GHZ Pentium IV PC, with 256 MB RAM and 40 GB HDD running Microsoft Windows XP. CLOMAINT is written in JAVA which data are represented in ORACLE 8i database table Experimental We compare the space requirement of use of Patricia tree and FP tree on different minimum support. We note that the space requirement in Patricia tree is less then FP-tree on different minimum support. S p ace req u irem en t (b yte) % 0.8% 0.3% 0.1% 0.05% 0.03% Support FIGURE 4. SPACE REQUIREMENT BETWEEN THE TWO STRUCTURES :PATRICIA TREE ET FP TREE 6.2. Results Below are some examples of the mining results obtained on implementation of the algorithm CLOMAINT in maintenance database. The result obtained is in form of: - 5 -

6 Rule 1 : Each time that we replace the article «INSULATING ADHESIVE TAPE VYNIL NOIR.LARGEUR 19. 3M SCOTCH 33» of equipment «MOTO-PUMP CONDENSATE» We 67 % replace the following article «VYNIL ADHESIVE TAPE INSULATOR BLACK SIZE 19 mm 23 3M SCOTCH» of equipment «MOTO-PUMP REFLUX TOWER OF WASHING» Rule 2: Each time that we replace the article «INSULATING ADHESIVE TAPE VYNIL NOIR.LARGEUR 19. 3M SCOTCH 33»of equipment «MOTO-PUMP CONDENSATE» We 22 % replace the following article «TEFLON ROLL OF m C & TROUVAY PTFE TAPE» of equipment «BOILERS HP» «Mining Frequent Itemsets using Patricia Tries». Department of Information Engineering University of Padova [SON 93]: SONATRACH «Approche des politiques maintenance des complexes LTG» 22/05/ Conclusion In this paper, we devised a solution procedure for mining closed frequent calling patterns from the data of maintenance database provided by one plant in SONATRACH. First, by observing the features of the database, the data cleaning procedure involved is collecting the attributes needed to be mined. After the data cleaning procedure, the items found in the attributes are merged into a set of sequences. By integrating the advantages of previous pattern growth algorithm, we designed CLOMAINT to discover the useful closed frequent repairs patterns. The result obtained is useful for the maintenance operators for making crucial decision. In the futur, we will introduce the time association rules for finding the chronological repairs and failure equipments in distributed data bases plants. REFERENCES [AGR 94]: R.AGRAWAL, R.SRIKAN «Fast Algorithms for Mining Association Rules» In VLDB 94 pp [BAS 00]: Yves BASTIDE, Nicolas PASQUIER, Rafik TAOUIL and Lotfi LAKHAL «Efficient Mining of association rules using closed itemset lattices». Laboratoire d Informatique (LIMOS). Université Blaise Pascal Clermont-Ferrand II. Information Systems Vol.24 No.1, pp [JIA 00]: H.JIAWI, J. PEI et R. MAO «CLOSET: An efficient Algorithm for mining frequent closed itemsets». In SIGMOD Int l Workshop on Data Mining and Knowledge Discovery, May [JIN 00]: P.JINA, H.JIAWI, Y.YIWEN, M.RUNYING «Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach» Data Mining and Knowledge Discovery, 8:53-87, 2000 [PAS 00]: Nicolas PASQUIER«Data Mining : Algorithmes d extraction et réduction des règles d association dans les bases de données» Thèse de doctorat Janvier Université Clérmont-ferrand II [PIE 03]: A.PIETRACAPRINA and D.ZANDOLIN - 6 -

Mining Frequent Patterns with Counting Inference at Multiple Levels

International Journal of Computer Applications (097 7) Volume 3 No.10, July 010 Mining Frequent Patterns with Counting Inference at Multiple Levels Mittar Vishav Deptt. Of IT M.M.University, Mullana Ruchika