Microarray gene expression data association rules mining based on BSC-tree and FIS-tree

Size: px

Start display at page:

Download "Microarray gene expression data association rules mining based on BSC-tree and FIS-tree"

Laurel Hancock
5 years ago
Views:

1 Data & Knowledge Engineering 53 (2005) Microarray gene expression data association rules mining based on BSC-tree and FIS-tree Xiang-Rong Jiang a, Le Gruenwald b, * a College of Pharmacy, University of South Carolina, Columbia, SC 29201, USA b School of Computer Science, The University of Oklahoma, Norman, OK 73019, USA Received 22 June 2004; accepted 22 June 2004 Available online 30 July 2004 Abstract In this paper we propose to use association rules to mine the association relationships among different genes under the same experimental conditions. These kinds of relations may also exist across many different experiments with various experimental conditions. In this paper, a new approach, called FIS-tree mining, is proposed for mining the microarray data. Our approach uses two new data structures, BSC-tree and FIStree, and a data partition format for gene expression level data. Based on these two new data structures it is possible to mine the association rules efficiently and quickly from the gene expression database. Our algorithm was tested using the two real-life gene expression databases available at Stanford University and Harvard Medical School and was shown to perform better than the two existing algorithms, Apriori and FP-Growth. Ó 2004 Elsevier B.V. All rights reserved. Keywords: Association rule mining; DNA microarray; Gene expression * Corresponding author. address: ggruenwald@ou.edu (L. Gruenwald) X/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi: /j.datak

2 4 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) Introduction 1.1. Problem statement DNA (deoxyribo nucleic acid) microarrays [25,27] enable scientists to study an entire genomeõs expression under a variety of conditions. The advent of DNA microarrays has facilitated a fundamental shift from gene-centric science to genome-centric science [5,6]. With several eukaryotic genomes completed and the draft human genome published [30], we are now entering the post genomic age. The main focus in genomic research is switching from sequencing to using the genome sequences in order to understand how genomes are functioning. Some questions we would like to ask are the following: What are the functional roles of different genes? In what cellular processes do genes participate? How are genes regulated? How do genes and gene products interact and what are these interaction networks? How does gene expression level differ in various cell types and states? How is gene expression changed by various diseases or compound treatments? With a tremendous increase of gene expression data collected by microarray technology, it is possible to answer these questions. However, one question raised is how we can analyze these data quickly and efficiently because traditional methods that biologists have employed to process and interpret their biological data are not suitable to deal with the huge amount of DNA microarray data. It is like what Brown wrote in [7]: Perhaps the greatest challenge now is to develop efficient methods for organizing, distributing, interpreting, and extracting insights from the large volumes of data these experiments will provide. With the development of data mining methods and software, it is possible to analyze the DNA microarray data. Data mining is an information extraction activity, the goal of which is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. The knowledge found from DNA microarray data by data mining may answer the questions mentioned above. So far many data mining methods have been used to mine gene expression data, such as clustering [4,8] and classification [18,9]. However these methods mainly focus on the gene expression profiles which are the sets of the expression values for a single gene across many experimental conditions. An example is the clustering of gene expression data that groups unknown genes with known genes in the same cluster and provides clues to their functions [16,3,10,19,17,26]. This is based on the hypothesis that genes that have similar gene expression trends at various conditions may have similar functions. In this paper, we propose using association rules to mine association relationships, such as one gene being the regulator of another gene, among different genes under the same experimental condition. For example, in the database in [29], we may find that gene TUP1 is the regulator of gene

3 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) YDR533C. These kinds of relations may also exist across many different experiments with various experimental conditions. This type of information is very important to answer the questions posed previously. It will help us understand gene regulation, metabolic and signaling pathways, and gene regulatory networks. When applying an association rule mining algorithm on the microarray gene expression data, the following characteristics must be taken into consideration: The large search space. A microarray gene expression database consists of the data obtained from many microarray slides under various experimental conditions. Each microarray slide can be considered as one database transaction containing the values of genes in one experimental condition, and each gene can be considered as one data item. For human beings there are 50, ,000 genes. There would be a tremendous number of candidate itemsets that must be identified by an association rule mining algorithm. For such an algorithm to work effectively, it must be able to deal robustly with the dimensionality of this feature space. Uninteresting genes. Not all genes are interesting to biologists. Sometimes biologists may be interested in some special genes. So they may just want to mine the association rules among these interesting genes and do not want to waste time to mine all other genesõ possible association rules. Data normalization. Due to technical limitations, the constant of proportionality between the actual number of mrna [20] samples per cell and the relative amount measured by a microarray experiment is unknown, and varies across microarray experiments. This variance introduces noise into experiments and requires that we normalize microarray data by an appropriate factor. The existing data association rule mining works [1,28,11,21,22] do not use the datasets similar to the microarray gene expression data and do not consider all the above characteristics of the microarray gene expression data even though they perform well when analyzing other data. The objective of this research is to develop an efficient association rule mining algorithm to analyze the microarray gene expression data by taking all their characteristics into consideration. This paper presents the proposed algorithm called FIS-tree mining and a performance evaluation comparing FIS-tree mining with the existing algorithms using two real life microarray gene expression databases from Stanford University [29] and Harvard Medical School [12]. The rest of the paper is organized as follows. Section 2 provides some background information on association rule mining and reviews the three existing association rule mining algorithms, Apriori [1,28], FP-Growth [11] and P-tree [21,22]. Section 3 describes the proposed algorithm, FIS-tree mining. Section 4 presents the performance evaluation. Finally Section 5 concludes the paper and discusses future research. 2. Related work Association rule mining (ARM) is a widely used technique for large-scale data mining. Originally, proposed for market basket data to study consumer-purchasing patterns in retail stores, it has potential applications in many areas. Microarray data is one of the promising application

4 6 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 areas. Very complex and highly interlinked data such as a spot in a microarray slide not only provides the information about its intensity of expression but also its interaction with other genes. Extracting interesting patterns and rules from a microarray database can be of importance in identifying gene regulation pathways where the expression of certain genes depends on the expression of other genes. It would also be possible to extract unknown genes, which can have significant biological implications. Here let us briefly introduce ARM as defined in [2]. For a database D, let T ={x1,x2,...,xn} be a set of distinct literals (or data items). Each T 2 D is called a transaction. A set X and Y T are sets of data items. An association rule is an expression of the form X ) Y. The intuitive meaning of such a rule is that in the transactions of database D where the attributes in X have a value of true, also the attribute Y has a value of true with high probability. There are primarily two measures of quality for each rule, support and confidence. The support (s%) is the ratio (in percent) of the number of records that contain the X [ Y to the total number of records in database D. The confidence (c%) is the ratio (in percent) of the number of records that contain the X [ Y to the number of records that contain X in database D. The goal of association rule mining is to find all the rules with support and confidence exceeding some user specified support and confidence thresholds. The problem is thus to mine for rules that satisfy user-specified minimum support and minimum confidence. There are hundreds of thousands of association rules in a given data set depending on its size and complexity. The process of mining for such rules is called association rule mining (ARM). For example, much of the genomic research requires the mining of microarray data for rules, implications, class structures and cluster-structures [14]. An association rule such as TUP1 ) YDR343C could mean that when gene TUP1Õs expression level increases, it is likely that gene YDR343CÕs expression level also increases. ARM has two important steps: one is to find all frequent itemsets, and another is to mine association rules from these frequent itemsets (an itemset X is called frequent if its support is equal to or greater than the user-defined minimum support). Usually the later step is straightforward to get association rules. So the main challenge is to find frequent itemsets. For practical applications looking at all subsets of T is doomed to failure due to the huge search space. There exist many association rule mining algorithms [1,28,21,22,13,23]. In this section, we review the three algorithms, Apriori [1,28], FP-growth [11], and P-tree [21,22] because of the following reasons: The Apriori algorithm is well known and has been employed in biological data analysis [14]. Also the Apriori algorithm has two important properties. One is that every subset of a frequent itemset is also frequent. Another is that every superset of an infrequent itemset is also infrequent. Based on these two properties we can save much search space and execution time. The FP-growth algorithm uses a novel frequent pattern tree (FP-tree) structure. A large database can be compressed into this highly condensed structure, which avoids the costly, repeated database scans. Also, FP-growth uses a frequent pattern growth method to avoid the costly generation of a large number of candidate itemsets. The P-tree (Penao Count Tree) algorithm provides a fast way to calculate the support for each itemset by using the root count of each itemsetõs P-tree.

5 2.1. The Apriori algorithm [1] X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) The Apriori algorithm was proposed by Agrawal et al. in 1994 [1]. It is an important association rule mining algorithm. It is the first algorithm in which the search space is reduced significantly. It assumes that items within an itemset are kept in the lexicographic order. The Apriori algorithm is based on the two important properties of frequent itemsets. One is that every subset of a frequent itemset is also frequent. The algorithm makes use of this property in the following way. The count of an itemset (i.e. the number of transactions which contain this itemset in the database D) is not needed if one of its subsets is not frequent. So, we can first find the counts of some short itemsets in one pass of the database. Then we consider longer and longer itemsets in subsequent passes. When we consider a long itemset, we can be sure that all its subsets are frequent. Another property is that every superset of an infrequent itemset is also infrequent. The Apriori algorithm begins with generating the 1-item frequent itemset (i.e. the frequent itemset each member of which contains only one data item) by scanning the entire database D. Next, in each pass, Apriori performs the two following operations: storing all potential candidates in a hash tree denoted as C, and scanning the entire database so that the support of each candidate can be counted. If its support is greater than or equal to the minimum support, then put the candidate in the frequent itemset. The k-item candidates are generated by joining in all possible ways the (k 1)-item frequent itemset (L (k 1) ) with itself. The algorithm terminates when L k is empty. The Apriori algorithm uses the property that every subset of a frequent itemset is also frequent. By only considering frequent itemsets of the previous pass, the number of candidate frequent itemsets is significantly reduced. It is generally acknowledged that the Apriori algorithm works well in terms of reducing the candidate itemset [1]. However, for gene expression databases there are many items (genes). For example there are more than 5000 genes on one microarray slide in a transaction. So, many candidate itemsets must still be generated which require repeated scans of the entire database for calculating each itemsetõs support. This costs a lot of memory space and execution time The FP-growth algorithm (without candidate generation) [11] The FP-growth algorithm is based on three principles: (1) it uses a novel frequent pattern tree (FP-tree) structure a large database is compressed into this highly condensed, smaller data structure, which avoids the costly, repeated database scans; (2) it uses a frequent pattern growth method to avoid the costly generation of a large number of candidate itemsets; and (3) it uses a partitioning-based, divide-and-conquer method to decompose the mining task into a set of smaller tasks for mining confined patterns in the conditional pattern base, which dramatically reduces the search space. The FP-tree mining algorithm consists of two algorithms, one is to construct the FP-tree and another called FP-growth is to derive frequent itemsets from the constructed FP-tree. The FP-tree construction algorithm works as follows. First, get the frequency of each item by scanning the whole database D. Second, sort items in their frequency descending order in each transaction. Third, construct a FP-tree which has the following structure: 1. It consists of one root labeled as null, a set of item prefix subtrees as the children of the root, and a frequent item header table.

6 8 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) Each node in the item prefix subtree consists of three fields: item-name, count, and node-link, where item-name registers which item this node represents, count registers the number of transactions represented by the portion of the path reaching this node, and node-link links this node to the next node in the FP-tree carrying the same item-name, or null if there is none. 3. Each entry in the frequent-item header table consists of two fields, item-name and head of node-link, which points to the first node in the FP-tree carrying the item-name. For building a FP-tree, first, create the root of a FP-tree, and label it as null. Then use the function called insert_tree to build the FP-tree. The FP-growth algorithm works as follows. A FP-tree-based pattern fragment growth mining method is developed, which starts from a frequent length-1 pattern (as an initial suffix pattern), and examines only its conditional pattern base (a sub-database which consists of the set of frequent items co-occurring with the suffix pattern). It then constructs its conditional FP-tree, and performs mining recursively with such a tree. The pattern growth is achieved via concatenation of the suffix pattern with the new patterns generated from a conditional FP-tree. There are several advantages of the FP-growth algorithm. First, it uses FP-tree, which is usually smaller than the original database, and thus it saves the costly database scans in the subsequent mining processes. Second, it applies a pattern growth method which avoids costly candidate generation and tests by successively concatenating frequent 1-itemsets found in the conditional FP-tree. Third, it applies a partitioning-based divide-and-conquer method which dramatically reduces the size of the subsequent conditional pattern bases and conditional FP-tree. The major disadvantage is when a database is large it is not feasible to construct a main memory-based FP-tree. This is due to the fact that many conditional pattern bases and conditional FP-trees are generated in this algorithm. For gene microarray, each slide has thousands of genes (corresponding to items in each transaction in a transactional database). Applying the FP-growth algorithm on microarray gene expression data will produce many conditional pattern bases and conditional FP-trees. Eventually it will use up all memory space. So it is not a good algorithm for mining microarray gene expression data P-tree algorithm [21,22] In the P-tree (Penao Count Tree) algorithm, the bsq data format (or called bsq file) is used to represent spatial data. In the bsq format bits are stored in a file in the order of bit SeQuential. Each bit file is represented in a P-tree from which a data cube is constructed. P-trees are considered data mining ready data structures as all the needed itemset counts are pre-computed and stored in these structures. The P-tree algorithm works as follows: 1. The bsq (bit Sequential) files are formed by storing the bits that come from the partition of the original data in the bit sequential order. For example, a gene expression level can be represented as a byte (8-bit). Each byte for one gene under one condition can be broken into 8 bits. Then each bit at the same position in the bytes of the gene expression levels is stored in one bsq file. Totally 8 bsq files (because each byte has 8 bits) are formed for one gene under different conditions.

7 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) P-trees are built from the bsq files. A P-tree is a quadrant-based tree. The fan-out of a P-tree construction needs to be 4 or any power of 4. P-trees can be generated by recursively dividing the entire data in bsq files into quadrants and recording the count of 1-bits for each quadrant. A P-tree is somewhat similar to a Quadtree and its variants [24]. The root of a P-tree contains the number of 1-bits within an entire bsq file. The next level of the tree contains the 1-bit counts of the four quadrants in raster order. At the next level, each quadrant is partitioned into sub-quadrants and their 1-bit counts in raster order constitute the children of the quadrant node. This construction is continued recursively down each branch of the tree until the sub-quadrant is pure (i.e. it contains only 1-bits or 0-bits but not both), which may or may not be at the leaf level. 3. Find 1-item frequent itemsets by using the following formula: The root count of the P-tree support ¼ The total nunber of transactions in the database where the root count of the P-tree is the number of 1-bits within an entire bsq file. If an itemsetõs support is equal to or larger than the minimum support, then this item is a 1-item frequent itemset. 4. Join L k 1 with itself to get k-item candidates. ANDing their P-trees to get the root count of the resulting P-tree. ANDing two P-trees means performing the logical AND operation on the bit type (0 or 1) at the leaf nodes in the two P-trees. Calculate the support by the formula mentioned in Step 3 and store the k-item frequent itemsets. Repeat this step until L k is empty. There are advantages with this algorithm. The bsq format facilitates better compression through creation of an efficient, rich data structure called P-tree and accommodates pruning based on a one-bit-at-a-time approach. P-tree structure is a data mining ready structure for association rule mining. It provides a fast way to calculate support and confidence for association rule mining. However, there are some disadvantages. For example, P-tree is a kind of quadrant tree. The fan-out of a P-tree can be only 4 or the power of 4. In microarray gene expression data, usually the bsq files are of various sizes (not just the power of 4). So efficiently building the P-trees for these kinds of files whose size may not be the power of 4 is a major problem. 3. The proposed FIS-tree mining algorithm In this section, a new algorithm, FIS-tree (Frequent ItemSet-tree) mining, is proposed for mining the microarray data. It considers the characteristics of microarray gene expression data discussed in Section 1. It attempts to incorporate all advantages of the three association rule mining algorithms reviewed in Section 2 and, at the same time, remove their disadvantages. It uses a data format for gene expression data where each value can be represented by a sign bit, fraction bits and exponent bits, and each bit at the same position in the data partition format can be organized into a bit string compression tree (BSC-tree). The BSC-tree originates from our real time data compression project [15]. A BSC-tree is a kind of real-time compression for a bit string, and can be built on the fly. In this research, additional functions for the BSC-tree are developed in

8 10 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 order to mine association rules from a database. The BSC-tree is used as a data mining ready data structure. Based on this structure it is possible to calculate the support and confidence for each itemset quickly. Then another data structure, FIS-tree, is proposed for finding and storing all possible frequent itemsets. With the data partition format and two new data structures, BSC-tree and FIS-tree, it is possible to mine the association rules efficiently from a gene expression database. The overall FIS-tree mining algorithm can be summarized as follows. First each geneõs expression value in the microarray database is represented as a bit string. Second, a BSC-tree is built for each gene by reading the bits from the geneõs bit string. Third, the support of each 1-item itemset is computed by using the counting information stored at the root (call root count) of each of these BSC-trees. 1-item itemsets are also called 1-itemsets; each member of these itemsets contains only one gene. Fourth, perform the logical AND operations on individual BSC-trees to obtain the resulting ANDing BSC-trees and use the root counts of these resulting trees to compute the support values of the k-item itemsets. Fifth, identify the frequent itemsets, store them in a FIS-tree, and process the FIS-tree to derive association rules that meet the minimum confidence requirement. In this section, we use the simplified microarray gene expression database shown in Table 1 to illustrate our proposed algorithm. In this simplified database, Gj represents a geneõs name, and slide represents one microarray slide under one experimental condition. Each gene the value of which is greater than some standard point of comparison, such as Timepoint-1 used in [29], is said to have an increasing value and is normalized to 1, otherwise is normalized to zero. There are totally eight genes and six slides in this database. One of the mining tasks that we are interested in is to find rules A ) B which indicates that if AÕs expression value increases then it is likely that BÕs expression value also increases. A and B are sets of genes G1 G8. To solve this task, we are interested in genes that have the value of 1 in the database Data partition There are various kinds of data on which we can perform data mining to get useful information. For gene expression data, usually the normalized data are used for gene expression levels, such as a positive fraction, negative fraction and zero. Here we propose to use a data format shown in Fig. 1 to represent geneõs expression values in our BSC-tree data structure. This format is the same as the computer representation of floating point numbers. Any number can be represented as a fraction and exponent. Then the fraction can be represented as one bit for the ± sign following Table 1 The sample of microarray gene expression database G1 G2 G3 G4 G5 G6 G7 G8 Slide Slide Slide Slide Slide Slide

9 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) n + m Fig. 1. Data partition format. n bits for the fraction, and the exponent can be represented as one bit for the ± sign following m bits for the exponent. How many bits to be used in our data format depends on the precision we want. For example, using 1 bit in our data format, we can compare the gene expression value with some standard point of comparison to identify whether the gene value increases or decreases ( 1 can be used to represent increase and 0 for decrease ). If we use 2 bits in our data format, then checking the left bit, we can identify whether the gene expression value increases or decreases, and checking the right bit, we can identify whether the gene expression value increases two times or more. There are several advantages for using this data partition format. First, it can represent any number precisely. Second, different bits have different degrees of contribution to the intensity value. In some applications, some of these bits give us enough information; so, we do not need all the bits. Third, this format facilitates better data compression (see the BSC-tree presentation in Section 3.2). Fourth, and most importantly, this format facilitates the creation of an efficient, rich data structure called BSC-tree which supports our FIS-tree mining algorithm. Some of these advantages are very useful when mining association rules from the microarray gene expression data. For example, if we just want to find the gene expression profiles for the genes whose expression levels increase, we just need to consider the sign bit (suppose that 1 for increased gene expression level, 0 for no change or decreased gene expression level as shown in the example database in Table 1). That will save memory space and execution time. In the rest of this paper we use only 1 bit for each gene in our example but this data format can be used for any number of bits, such as 8 bits for one gene A BSC-tree data structure A BSC-tree is a bit string compression tree and is used as a data mining ready data structure. One BSC-tree is built for each gene whose expression data have been collected under various experimental conditions and represented as a bit string (for example, from Table 1, gene G1 is represented as the bit string , gene G2 as , where each bit in the string represents the increase/decrease of the expression value of the corresponding gene in a microarray slide). In each BSC-tree, the sequence order of each bit, which comes from the partition of the gene expression data collected under various experimental conditions, is fixed. So it is easy to apply the AND, OR, XOR...logical operations to the BSC-trees of genes in order to build a BSC-tree representing multiple genes (k-item itemsets). The BSC-tree will be used in our FIS-tree algorithm to find k-item frequent itemsets (or also called frequent k-itemsets) as discussed in Section 3.5. In a BSC-tree, each node has the following format: node-leveljbit-typej1-bit-count Here node-level is a virtual level value used to build the BSC-tree. The bit-type shows 1 for all 1-bit nodes, 0 for all 0-bit nodes and m for all 1, 0 mixture bit nodes. 1-bit-count shows the number of

10 12 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) bits in the subtree or the entire tree (1-bit-count will be used to calculate the support of each gene in the FIS-tree mining algorithm presented in Section 3.5 to identify genes whose expression values increase). Depending on optimum data compression, we can choose one of the BSC 2,3,4,5,6,...,n -Tree (n is the number of child nodes in each node, n < the maximum number of items in a transaction). We have several ways to build the BSC-tree [15], here we show one of these ways. We build the BSC-tree in the bottom-up model. For example, we want to build a BSCtree in which each node has two children (i.e. building a BSC 2 -Tree) to represent a gene from its bit string. First, our algorithm creates each node with 2 (L 1) bits in virtual level L (initially L = 1). Second, if there are two same virtual level nodes, the merge function called Merger() will be invoked to merge these two lower virtual level L nodes (subtrees) into a new higher virtual level (L = L + 1) node (subtree). If the bit types in these two lower virtual level L nodes (subtrees) are the same, we do not need to keep these lower virtual level nodes (sub-subtrees) for the new higher virtual level node (subtree); so this new node (subtree) is the leaf node of the branch of this tree; otherwise, we keep these lower virtual level nodes (sub-subtrees) as the child nodes of the new higher virtual level node (subtree). Third, similarly we continue to create a new node with 2 (L 1) bits in virtual level L (now L = L + 1). During the course of creating the new node, if the bit type of the node is different from the bit type of the bit which is currently read from the bit string representing the gene, then the node is broken down into the lower virtual level node (L = L 1). This way is carried out until the node with the same bit type is formed (node size = 2 (L 1) bits). If there are two same virtual level nodes, then the algorithm repeats the Merger() function. Fourth, after all expression data for one gene are read, if there are two same virtual level nodes, then the algorithm repeats the Merger() function. Otherwise, the lower virtual level node is upgraded to the next higher virtual level node (L = L + 1). If there are two same virtual level nodes, then the algorithm repeats the Merger() function. This process continues until the BSC-tree root node is formed. The 1-bit count at the root node (called the root count) will tell us the number of 1-bits that the gene has. This procedure is applied to all genes. The pseudo code for building the BSC-tree is shown in Fig. 2. Fig. 3 shows the resulting BSC-trees representing genes G1, G3 and G6 from Table The BSC-tree ANDing algorithm In this section we describe the BSC-tree ANDing algorithm, which performs the logical AND operations on the BSC-trees of individual genes to create combined BSC-trees representing sets of items each of which contains k genes (called k-item itemsets). This operation is called ANDing BSC-trees. The root counts of the resulting BSC-trees will be used to calculate the support values of k-item itemsets. Here we give a simple example for performing the logical AND operation on the bit type (0 or 1) at the leaf nodes in the two BSC-trees. Suppose that we have two BSC-trees representing genes Gx and Gy as follows (Fig. 4): Step 1. Marking each left path in a BSC-tree as 0, and each right path as 1. Then we get each 1-bit leaf nodeõs path code, such as those shown in the rectangles in Fig. 5. Step 2. Store 1-bit type leaf nodesõ path codes with their 1-bit counts for each BSC-tree in an array for ANDing. The root count of a BSC-tree is the sum of 1-bit counts in all 1-bit

11 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) Fig. 2. The algorithm for building a BSC-tree.

12 14 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 A 4 m 5 B C 4 m 2 3 m m m Fig. 3. A is the BSC-tree of G1; B is the BSC-tree of G3; C the BSC-tree of G6. BSC-tree_1 5 m m 5 3 m 3 3 m 2 2 m BSC-tree_2 5 m 5 4 m 1 4 m m m Fig. 4. Two sample BSC-trees. leaf nodes of the BSC-tree. For example, in Figs. 6 and 7, the 1-bit nodesõ path codes and the 1-bit counts in all 1-bit leaf nodes of the BSC-tree_1 and BSC-tree_2 are stored in the array_1 and array_2, respectively. In our proposed FIS-tree mining algorithm, we only need to know the root counts of BSC-trees for calculating the support values of itemsets. The root count of a BSC-tree is the sum of 1-bit counts in all 1-bit leaf nodes of the BSC-tree. So in our FIS-tree mining algorithm, a BSC-tree is stored in the form of the array described above. This is a major advantage of the FIS-tree mining algorithm. Because the FIS-tree mining algorithm can compress the original database two times to save memory space: the first time, the original database is compressed into a BSC-tree, and the second time, the BSC-tree is compressed into a 1-bit leaf nodesõ path codes array with their 1-bit counts. Step 3. ANDing BSC-trees by using the arrays described above. In our algorithm, any bit string starts with one zero (0) is called a subcode of 0. Any bit string starts with two zeros (00) is called a subcode of 00, and so on. Any bit string starts with 1 is called

13 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) BSC-tree_1 5 m m m 3 3 m m BSC-tree_2 5 m m 1 4 m m m Fig. 5. Getting the path codes from BSC-trees. Path code 1-bit counts Fig. 6. Array_1 stores the 1-bit nodesõ path codes and the 1-bit counts in all 1-bit leaf nodes of BSC-tree_1. Path code bit counts 1 4 Fig. 7. Array_2 stores the 1-bit nodesõ path codes and the 1-bit counts in all 1-bit leaf nodes of BSC-tree_2. a subcode of 1. Any bit string starts with 11 is called a subcode of 11, and so on. Any bit string starts with 01 is called a subcode of 01. Any bit string starts with 10 is called a subcode of 10, and so on. For convenience, to compare other path codes, in general, we compare the path codes bitby-bit from left to right and 1 is considered larger than 0. For example, the path code 11 is large than 10*. In this ANDing BSC-tree algorithm we define that the path code 11 is larger than

14 16 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 the path code 101 or 1001 because in a BSC-tree, the shorter path code, such as 11, is always in a higher virtual level than the longer path code, such as In the ANDing BSC-tree algorithm we always look for sub-path-codes (subcodes) and store them in a resulting array with the corresponding 1-bit counts. This is because in Step 3, we already have two 1-bit leaf nodesõ path codes arrays, array_1 and array_2. If a path-code-1 in array_1 is the subcode of a path-code-2 in array_2, then after ANDing, the resulting path code is equal to the path-code-1. The pseudo code for the ANDing BSC-trees algorithm using path code arrays is given in Fig. 8, which is a modification of the algorithm reported in [21,22]. Applying this algorithm to the two arrays in Figs. 6 and 7, we obtain array_3 shown in Fig. 9 which represents the 1-bit nodesõ path codes in all 1-bit leaf nodes of the ANDing BSC-tree of the two BSC-trees in Figs. 6 and 7. This shows that the 1-bit count at the root node (or also called root count) of the resulting ANDing BSC-tree representing the 2-itemset GxGy is = FIS-tree In the FIS-tree mining algorithm, we propose a new structure called FIS-tree (Frequent-Item- Sets tree) to find all possible k-item frequent itemsets efficiently. In this algorithm the candidate itemsets are generated on the fly and do not need to be stored. With ANDing BSC-trees, the FIS-tree grows from the 1-item frequent itemsets to the k-item frequent itemsets. Finally, all possible (1 k)-item frequent itemsets are stored in one FIS-tree. Before building the FIS-tree, one BSC-tree is built as shown earlier for each gene. The FIS-tree algorithm uses the root count of each BSC-tree to calculate the support for each gene. In our system, each microarray slide is one transaction and each gene on the microarray is one item. In a BSC-tree for gene G i, the root count means the number of transactions that contain gene G i whose expression level increases. So, the support for G i is calculated as follows: Support ¼ The root count of the BSC-tree The total number of transactions in the database For building a FIS-tree, first, create the root node, find all 1-item frequent itemsets using support P minimum support, and build the FIS-tree of the 1-item frequent itemsets with level 1. Second, for level k, start from the root node of the FIS-tree, check all nodes until level k 1(k =2to m, m is equal to the number of 1-item frequent itemsets because the possible longest frequent itemset is composed of all 1-item frequent itemsets). Perform the AND operation on the BSC-trees of all nodes from the root node to a node, say X, at level k 1 to get the resulting ANDing BSCtree, say T1; then use T1 ANDing with each BSC-tree of the node XÕs siblings in the right side (the same process has already been done on the siblings in the left side) to get the new ANDing BSCtrees. This process is the same as joining L k 1 with L k 1 to get k-itemset candidate C k in the Apriori algorithm [1]. At the same time the support values of C k can be calculated using the root counts of these new ANDing BSC-trees. If support is greater than or equal to the minimum support (that means this C k is a frequent k-itemset L k ), then one new subnode (with the same name as that of this XÕs sibling node) is generated under node X; this means that L k is formed and stored in the FIS-tree. This process is applied to every levelõs nodes recursively until the entire FIS-tree is built.

15 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) Fig. 8. The algorithm of ANDing BSC-tree (modified from [21,22]).

16 18 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 Path code bit counts 1 2 Fig. 9. array_3 stores the 1-bit nodesõ path codes in the 1-bit leaf nodes of the resulting ANDing BSC-tree of the two BSC-trees in Figs. 6 and 7. The second step is a very important step. In this step we have taken the advantages of the Apriori algorithm [1]. These advantages combined with the characteristics of microarray gene expression data ensure that our FIS-tree algorithm works correctly. The reasons are as follows. First, in a microarray slide each gene position is fixed. The order of genes in every slide is the same. So we just use this order and do not need to rearrange it like in the Apriori algorithm. Second, in our FIS-tree algorithm we use the properties that are used in the Apriori algorithm (e.g. every subset of a frequent itemset is also frequent and every superset of an infrequent itemset is also infrequent). We use the same method as that in the Apriori algorithm to generate the candidate itemsets. So we will never lose any possible candidate frequent itemsets [1]. But with the FIS-tree algorithm we do not need to store these candidate frequent itemsets and do not need to scan the entire database to get each candidate frequent itemsetõs count. Instead, we generate the candidate frequent itemsets on the fly and calculate their support by getting the root counts of the BSC-trees. After calculating each candidate frequent itemsetõs support, all frequent itemsets are stored in a compressed data structure in a FIS-tree. Fig. 10 shows the algorithm for building a FIS-tree. As an example, let us use the database D shown in Table 1 and let us suppose that the user-defined minimum support is 50% (minsup = 50%). First, the basic BSC-trees are built. Then use the root counts to calculate the support, G1Õs BSC-tree root count = 5, G2, G4, G7 and G8Õs BSC-trees root count = 3, their supports P minsup, So G1, G2, G4, G7 and G8 are frequent 1-itemsets in the FIS-tree level 1 (Fig. 11). For building frequent 2-itemsets, combine each leaf node in level 1 with its siblings with the same parent node in the right side, such as G1G2, G1G4, G1G7, G1G8, and then get the ANDing BSC-trees. The root counts of the G1G2, G1G4 and G1G7Õs ANDing BSC-trees are 3. Their support values are equal to minsup (0.5). So they are in frequent 2-itemsets (level 2 in Fig. 11). The root count of G1G8Õs ANDing BSC-tree is 2. Its support <minsup; so G1G8 is infrequent 2-itemset and not in the FIS-tree. Follow the same procedure for each node in level 1 to generate its level 2 sub-nodes. In the same way the nodes in level 3 (frequent 3-itemsets) are generated from level 2 nodes. For example, starting from the root, perform depth first search to the most left node (e.g. root! G1! G2), build the ANDing BSC-tree of G1G2, say T, then ANDing T with the basic BSC-tree of each sibling (G4, G7) in the right side under the same parent node to get the ANDing BSC-trees of G1G2G4 and G1G2G7. Because their root counts are 1, their support <minsup, so no sub-node is formed. Then we back up to G2Õs nearest parent node (G1) to search the next most left node (G4). Similarly, we get G1G4G7. The root count of G1G4G7Õs BSC-tree is 3 and its support = minsup; so a new sub-node G7 is formed under G4. In this way each node can generate its next level nodes

17 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) Fig. 10. The algorithm of building an FIS-tree. until the entire FIS-tree is completed. This is a very efficient way to generate all frequent k- itemsets. In Fig. 11, all frequent itemsets, G1, G2, G4, G7, G8, G1G2, G1G4, G1G7, G2G8, G4G7, and G1G4G7, are stored in one FIS-tree efficiently.

18 20 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 Root level 0 G1 G2 G4 G7 G8 leve1 1 G2 G4 G7 G8 G7 level 2 G7 level 3 Fig. 11. FIS-tree Deriving association rules from a FIS-tree Deriving association rules from frequent itemsets is a straightforward process. Here we modify the known algorithm report in [1] to derive the association rules from a FIS-tree. Let D be a database, jdj be the total number of transactions in D, I be the set of all items in D and, for X 2 I, jxj is the number of transactions in D that contain itemset X. The algorithm for discovering association rules is shown in Fig. 12; it is a modification of the algorithm proposed in [1]. For example, G1G2 is a frequent 2-itemset in FIS-tree. Suppose that minsup is 50% and userdefined minimum confidence, minconf, is 50%. For G1G2, support = 3/6 = 50%, for G1 in Fig. 3(A), support = 5/6 = 83.3%. So, confidence = jg1 [ G2j/jG1j = 3/5 = 60% > minconf, then rule G1 ) G2 holds. Microarray data are good examples of datasets to mine association rules. The red/green log ratio values from spots (for genes) in the microarray are converted into bit strings in our data partition format. Each bit string then is converted into a basic BSC-tree from which data compression is applied. A FIS-tree is then built from these BSC-trees for all genes. The FIS-tree can be successfully used to derive the association rules from the microarray data. These rules can provide valuable information to the biologists concerning gene regulatory pathways and can identify important relationships between the different gene expression patterns. 4. Performance evaluation In this section we introduce two real-life microarray gene expression datasets which we use to measure the execution time of our proposed algorithm and the two other existing association rule mining algorithms, Apriori and FP-growth. We then present our comparison results. We did not implement the P-tree association rule mining algorithm because its detailed algorithm is not available to public (the algorithm has been patented by its authors) Two real-life databases To test our proposed algorithm, we use two real-life microarray gene expression datasets denoted by DB1 (Table 2) and DB2 (Table 3) from Stanford University [29] and Harvard Medical School [12], respectively. In Table 2, ORF (open reading frame) represents a putative gene, name represents the gene function name (if no name is under this entry, then this gene function is unknown), Gi means that the microarray slide is scanned by a green laser beam and Ri means that the microarray slide

19 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) Fig. 12. The algorithm for deriving association rules from frequent itemsets. is scanned by a red laser beam. This database has collected 7 groups of data at 7 time points (represented as Timepoints 1 7, in Table 2 only Timepoint 1 is shown). G1 is the data that is read from the gene spots on the microarray slide at Timepoint 1 by using a green laser beam to scan the microarray slide. R1 is the data that is read from the gene spots on the microarray slide at

20 22 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 Table 2 A portion of DB1 ORF Name G1 G1.Bkg R1 R1.Bkg F1 YHR007C ERG YAL051W FUN YAL055W YAL056W YAL058W CNE YOL109W YAL065C YAL066W Table 3 A portion of DB2 ORF SGDID OrigORFNames Cho_0 Cho_10 Cho_20 Cho_30 Cho_40 YAL004W S YAL004W e 005 5e 005 5e 005 YAL005C S YAL005C YAL007C S YAL007C YAL008W S YAL008W 9e 005 6e 005 7e 005 1e 005 3e 005 YAL009W S YAL009W 6e 005 2e 005 4e 005 3e 005 5e 005 YAL010C S YAL010C e 005 4e 005 5e 005 6e 005 YAL011W S YAL011W 6e 005 5e 005 4e 005 4e 005 8e 005 YAL014C S YAL014C e Timepoint 1 by using a red laser beam to scan the microarray slide. G1.Bkg is the data that is read from the microarray slide background at Timepoint 1 by using a green laser beam. R1.Bkg is the data that is read from the microarray slide background at Timepoint 1 by using a red laser beam. F1 refers to the user flag where any number other than zero indicates a problem with that array spot. This can occur when dust or particles interfere or overlay an array element. In DB1 each Timepoint (one microarray slide) corresponds to one transaction. There are seven transactions in DB1. In each transaction there are 6150 genes. The size of each numeric item in BD1 is 4 bytes. Before these data are used by our association rule mining program, they are normalized because of the following. Gene expression is the process by which mrnas, and eventually proteins, are synthesized from the DNA template of each gene. Due to technical limitations, the constant of proportionality between the actual number of mrna samples per cell and the relative amount measured by a microarray experiment is unknown, and varies across microarray experiments. This variance has a negative influence on the data analysis process, such as the clustering process [4,8], and requires that we normalize microarray data by multiplying the array results by an appropriate factor; therefore we use the normalized data in order to discover more accurate results. We normalize the data in DB1 as follows. First, we get the ratio (r) of red value to green value from each spot on the microarray slide. r ¼ R R:Bkg G G:Bkg

21 We choose Timepoint-1 as the control (as a standard point for comparison). Using the next formula we get the log ratio of each gene expression value at each time point. r Ratio ¼ log r1 where r1 is the Timepoint-1 ratio (r1 = (R1 R1.Bkg)/(G1 G1.Bkg)). So, in this way, if a gene expression level increases, its ratio will be larger than zero; if a gene expression level decreases, its ratio will be less than zero; otherwise, its ratio will be equal to zero (no change). For DB2, we use the normalized version. In Table 3, ORF means the putative gene symbol. SGDID is an ID number for each gene in this database. OrigORNames is the geneõs original name from its originally published paper. Cho_0, Cho_10,..., Cho_40 represent the normalized gene expression values under different experimental conditions. In DB2 each experimental condition (one microarray slide) corresponds to one transaction. In DB2 there are 213 transactions. In each transaction there are 6400 genes. The size of each numeric item in BD2 is 4 bytes Experimental results X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) The experiments were performed on a SUN workstation Ultre5 Sparc-IIi, clocked at 360 MHz. All programs of FIS-tree, Apriori and FP-growth are written in the C++ language. The execution time used in this section means the total execution time, i.e., the period between the input of the original data and the output of the frequent itemsets. We compare the effect of support threshold, number of transactions and number of frequent 2-itemsets on the execution time incurred by Apriori, FP-growth and FIS-tree on DB1 and DB2. The support threshold is a very important parameter in the association rule mining process. For an itemset its support threshold is the percentage of the number of transactions that contain this itemset over the total number of transactions in a database. This parameter has a big effect on the execution time because the number of frequent itemsets changes dramatically when the support threshold changes. The number of transactions represents the size of a database. So this parameter is used to test the scalability of the three algorithms. The reason that we choose the number of frequent 2-itemsets as a parameter is as follows. In the biologistsõ labs, one of the approaches to defining putative gene regulators is to use the DNA microarray to identify genes whose expression are affected by mutations in each putative regulatory gene (each time only one putative gene regulator is deleted). Here we may use association rule mining to mimic this process. So, for this kind of approaches, we can find all frequent 2-itemsets, and then derive association rules from them as follows: the putative gene regulator ) the affected gene. In this way we do not waste time to find other longer frequent itemsets for association rule mining Varying (minimum) support threshold In this testing case, the support threshold is varied from 10% to 100%. In DB1 we consider the genes whose expression levels are twice (or more) of the control. Based on our experiences dealing with microarray gene expression data in biological labs, we believe that the gene expression level increases significantly if its expression level is twice (or more) of the control. In biological chemistry, for the statistical data we can think that two data values are significantly different when one

22 24 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 value is two times or more greater or less than the other. With the same reason we choose the genes whose expression levels are larger than in DB2. As shown in Figs. 13 and 14, for the datasets DB1 and DB2 at the higher support threshold values (in Fig. 13, support threshold P 0.4 and in Fig. 14, support threshold P 0.5), the FPgrowth and FIS-tree mining algorithms run very fast. No significant difference between FIS-tree and FP-growth is observed. But at the lower support threshold values there are significant differences between our FIS-tree mining algorithm and the other two algorithms. At the low support threshold values FIS-tree mining runs faster than FP-growth and Apriori (in Fig. 13, support threshold and in Fig. 14, for FIS-tree vs. FP-growth, support threshold 6 0.4; for FIS-tree vs. Apriori, support threshold 6 0.5). The reason is that for each itemset its support threshold is the percentage of the number of transactions that contain this itemset over the total number of transactions in a database. The lower support threshold means that fewer transactions contain this itemset in the database. The higher support threshold means that more transactions contain this itemset in the database. So the higher the support threshold is, the fewer frequent itemsets are obtained (see the number of frequent itemsets shown in Figs. 13 and 14). At a higher support threshold, the number of frequent itemsets is small. So even the Apriori and FP-growth algorithms do not need too much time to generate the small number of frequent itemsets. They all run fast. No significant difference is observed between them and the FIS-tree mining algorithm. Fig. 13. Execution time of the algorithms vs. support threshold for expression level >2 times of the control in the reallife dataset DB1 with 7 transactions. Fig. 14. Execution time of the algorithms vs. support threshold for expression level >0.001 in the real-life dataset DB2 with 100 transactions.

23 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) With the similar reasons mentioned above, the lower the support threshold is, the more frequent itemsets are obtained (see the number of frequent itemsets shown in Figs. 13 and 14). Thus the FP-growth produces more conditional pattern bases and more conditional FP-trees for generating the more frequent itemsets, which costs more memory space and execution time. At a lower support threshold, our FIS-tree mining algorithm runs much faster than the Apriori algorithm. The reason is that, for low support threshold, the number of candidate itemsets is extremely high. For each candidate itemset, Apriori needs to scan the entire database one time for calculating its support. Thus the performance in generating frequent itemsets degrades dramatically. For our FIS-tree mining algorithm, however, all support values can be calculated on the fly by using ANDing BSC-trees. Especially our FIS-tree mining algorithm does not need to wait until all bits in the input bit string are read before starting to build the BSC-tree. This provides a fast way to calculate the support for data association rule mining. Similarly, in Fig. 15, in generating frequent 2-itemsets, when the support threshold is 0.1 our FIS-tree mining algorithm is two times faster than the FP-growth algorithm. However, when the support threshold is equal to or larger than 0.3, no significant difference between our FIS-tree mining algorithm and the FP-growth algorithm is observed Varying number of transactions To test the scalability of the algorithms in terms of the number of transactions, experiments on DB2 are performed. The support threshold is set to 40%. In DB2 we choose the gene whose expression level is higher than as the gene whose expression level increases significantly. The number of transactions is varied from 20 to 100. The results are shown in Fig. 16 (in DB1 there exist only seven transactions; so we do not use it in these experiments). Fig. 15. Execution time of the algorithms for generating frequent 2-itemsets vs. support for expression level >2 times of the control in the real-life dataset DB1. Fig. 16. Execution time of the algorithms vs. the number of transactions for expression level >0.001 in the real-life dataset DB2 (support = 0.4).

24 26 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 As can be seen in Fig. 16, our FIS-tree mining algorithm is more scalable than the Apriori algorithm and the FP-growth algorithm. Under the experimental conditions shown in Fig. 16, with the number of transactions increasing, FIS-tree is faster than FP-growth and much faster than Apriori. The reason is that the number of transactions represents the size of a database. We know that FP-growth needs to scan the entire database two times to build the FP-tree and Apriori needs to scan the entire database each time for calculating each itemsetõs support. However, for FIS-tree mining, the following reasons make it more scalable than others. First, FIS-tree mining only needs to scan the entire database one time. Second, the BSC-tree is a compression tree and the original database is highly compressed into a small data structure. Third, two times of compression have been achieved in FIS-tree mining. The first time, the original database is compressed into a BSCtree, and the second time, the BSC-tree is compressed into a 1-bit leaf nodesõ path code array with its 1-bit counts. These two times of compression are good for saving memory. And fourth, an FIStree is also a compression tree. In an FIS-tree, each branch is a longest frequent itemset. All subfrequent itemsets with the same prefix are included in this longest frequent itemset, and no extra memory is needed to store them. Based on these advantages, the larger the database, the higher the compression ratio. So our FIS-tree mining algorithm is more scalable than FP-growth and Apriori Varying number of frequent 2-itemsets The reason that we choose the number of frequent 2-itemsets as a parameter is to try to mimic the process which biologists use for defining putative gene regulators by using the DNA microarray data. Experiments on DB1 are executed. The support threshold is set to 10%. In DB1 we consider the gene whose expression level is twice (or more) of the control geneõs expression level for the same reason mentioned earlier. The results of the execution time vs. the number of frequent 2-itemsets in DB1 are shown in Fig. 17. As the number of frequent 2-itemsets increases, significant differences are observed between our FIS-tree mining algorithm and both the FP-growth and Apriori algorithms. When the number of frequent 2-itemsets is 500,000 our FIS-tree mining algorithm is 800 times faster than the Apriori algorithm and nearly two times faster than the FP-growth algorithm. From the FP-growth algorithm we know that to generate a large number of frequent itemsets, FP-growth produces more conditional pattern bases and more conditional FP-trees, which costs memory space and execution time. For the Apriori algorithm to generate a large number of frequent itemsets, it needs to scan the entire database more times. For our FIS-tree mining algorithm, all support values can be calculated on the fly by using ANDing BSC-trees. That is why FIS-tree is much faster than the other two when the number of frequent 2-itemsets increases. Fig. 17. Execution time of the algorithms for generating frequent 2-itemsets for expression level >2 times of the control in the real-life database DB1 (support = 0.1).

25 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) Conclusions and future research In this paper, we proposed a new association rule mining algorithm called the FIS-tree mining algorithm that makes use of a bit string data partition format and two new data structures, the BSC-tree and FIS-tree. The FIS-tree mining algorithm takes microarray gene expression dataõs characteristics into consideration. A BSC-tree is a compression tree. It can be built on the fly for each gene to compute frequent 1-itemsets. Frequent 2 to n-itemsets are computed by performing the logical AND operations on individual BSC-trees. A BSC-tree provides a fast way to calculate support and confidence for association rule mining using only one database scan. An FIS-tree is built to store all frequent itemsets. An FIS-tree is also a compression tree. In an FIS-tree, each branch is a longest frequent itemset, all sub-frequent itemsets with the same prefix are included in this longest frequent itemset, and no extra memory is needed to store them. The FIS-tree mining algorithm then processes the FIS-tree to derive desired association rules. The paper then presented the experimental results comparing the proposed FIS-tree mining algorithm and the two existing association rule mining algorithms, Apriori [1] and FP-Growth [11] in terms of execution time. The experiments used two real life microarray gene expression datasets from Stanford University [29] and Harvard Medical School [12]. Three sets of experiments were performed to study the effects of the support threshold, the number of transactions, and the number of 2-item frequent itemsets on the execution time of each algorithm. The experiment results showed that the FIS-tree mining algorithm performs the best, FP-Growth the second best, and Apriori the worst in all the three testing experiments. The FIS-tree mining algorithm performs significantly better than the other two algorithms when one of the following conditions occurs: the support threshold is low, the database size in terms of the number of transactions is large, or the number of 2-item frequent itemsets is high. In this study, we applied the association rule mining algorithms to one data mining task in which we tried to derive association rules that indicate that if the expression value of a gene A (or of a set of genes) increases, it is likely that the expression value of a gene B (or of a set of genes) also increases. However, the proposed FIS-tree mining algorithm has the merit of generality. Therefore, in the future we will extend our work to apply the proposed algorithm to other data mining tasks such as finding gene regulation and gene network information from microarray gene expression data. We will study the case where there is an interest in finding not only increasing but also decreasing trend of gene expression. For decreasing, to do this kind of association rule mining, we just need to make a small change to our FIS-tree mining algorithm. For example, we get the association rule in DB1: TUP1 ) YDR343C Here TUP1 and YDR343C are gene function names. If using the proposed FIS-tree mining algorithm, then the association rule means that when TUP1Õs expression level increases, YDR343CÕs expression level also increases. However, with the same association rule, if we switch bit 1 with bit 0 in performing the ANDing operation on the basic BSC-tree of TUP1 with YDR343CÕs basic BSC-tree, then the rule means that when TUP1Õs expression level decreases, YDR343CÕs expression level increases. That is because before switching, bit 1 indicates that TUP1Õs expression level increases, bit 0 indicates that TUP1Õs expression level decreases, after switching, bit 1 indicates that TUP1Õs expression level decreases, and bit 0 indicates that TUP1Õs expression level increases.

26 28 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 Finally we will extend our algorithm to perform association rule mining on multi-databases. References [1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, 20th VLDB (1994) [2] R. Agrawal, T. Imielinski, A.N. Swami, Mining association rules between sets of items in large databases, in: Proc. ACM SIGMOD Int. Conf. Management of Data, Washington, DC, vol , 1993, pp [3] K. Alsabti, S. Ranka, V. Singh, An efficient K-means clustering algorithm, in: Proc. IPPS/SPDP Workshop on High Performance Data Mining, [4] A. Brazma, J. Vilo, Gene expression data analysis, FEBS Letters 480 (2000) [5] M. Chee, et al., Accessing genetic information with high-density DNA, Science 274 (1996) [6] J. DeRisi, et al., Use of a cdna microarray to analyze gene expression patterns in human cancer, Nature Genet. 14 (1996) [7] J.L. DeRisi, V.R. Lyer, P.O. Brown, Exploring the metabolic and genetic control of gene expression on a genomic scale, Science 278 (1997) [8] M. Eisen, P.T. Spellman, D. Botstein, P.O. Brown, Cluster analysis and display of genome-wide expression patterns, in: Proc. National Academy of Science, USA, 1998, pp [9] J. Fridlyand, S. Dudoit, Comparison of supervised learning methods for the classification of tumors using gene expression data, Quantitative Challenges in the Post Genomic Sequence Era: A Workshop and Symposium, San Diego, CA, January [10] B. Fritzke, A growing neural gas network learns topologies, in: G. Tasauro, D. Tourestzky, T. Leen (Eds.), Advances in Neural Information Processing Systems, vol. 7, MIT Press, Cambridge, MA, 1995, pp [11] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: Proc. ACM-SIGMOD Int. Conf. Management of Data, Dallas, Texas, USA, May [12] Harvard Medical School, Available from November [13] J. Hipp, C. Guntzen, et al., Mining association rules: deriving a superior algorithm by analyzing todayõs approaches, in: Proc. 4th Euro. Conf. Principles and Practice of Knowledge Discovery, Lyou, France, [14] H.D. Huang, H.L. Chang, T.S. Tsou, B.J. Liu, C.Y. Kao, J.T. Horng, A data Mining Method to Predict Transcriptional Regulatory Sites Based on Differentially Expressed Genes in Human Genome, in: Third IEEE Symposium on BIBE, 2003, pp [15] X.R. Jiang, Y.Z. Wu, The project of a Method of the Real Time Data Compression for Bit String. [16] R.W. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, Prentice Hall, Upper Saddle River, NJ, [17] N.B. Karayiannis, J.C. Bezdek, An integrated approach to fuzzy learning vector quantization and fuzzy c-means clustering, IEEE Trans. Fuzzy Syst. 5 (4) (1997) [18] A.D. Keller, M. Schummer, L. Hood, W.L. Ruzzo, Bayesian Classification of DNA Arrar Expression Data, Technical Report UW-CSE , University of Washington, August [19] T. Kohonen, The self-organizing map, in: Proc. IEEE, vol. 9, 1990, pp [20] B. LewinGENES, VI, Oxford University Press, 1997, p [21] W. Perrizo, Q. Ding, Q. Ding, A. Roy, Deriving High Confidence Rules from Spatial Data using Peano Count Trees, Springer-Verlag, LNCS 2118, July [22] W. Perrizo, Q. Ding, Q. Ding, A. Roy, On mining satellite and other Remetely Sensed Images, DMKD, 2001, pp [23] J. Pei, J. Han, R. Mao, An efficient algorithm for mining frequent closed itemsets, in: Proc ACM-SIGMOD Int. Conf. Management of Data, Dallas, Texas, USA, May [24] H. Samet, The quadtree and related hierarchical data structure, ACM Comput. Surv. 16 (2) (1984) [25] D. Shalone, S.J. Smith, P.O. Brown, A DNA microarray system for analyzing complex DNA samples using twocolor fluorescent probe hybridization, Genome Res. 6 (7) (1996) [26] R. Sharan, R. Shamir, CLICK: A clustering algorithm with applications to gene expression analysis, in: Proc. 8th Int. Conf. Intelligent Systems for Molecular Biology, 2000, pp

X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) 3 29 29 [27] M. Schena, D. Shalon, R.W. Davis, P.O.

Agrawal, Mining quantitative association rules in large relation tables, in: Proc. ACM SIGMOD, 1996, pp. 1 12. [29] Stanford University, Available from http://cmgm.stanford.

27 X.-R. Jiang, L. Gruenwald / Data & Knowledge Engineering 53 (2005) [27] M. Schena, D. Shalon, R.W. Davis, P.O. Brown, Quantitative monitoring of gene expression patterns with complementary DNA microarray, Science 270 (1995) 467. [28] R. Srikant, R. Agrawal, Mining quantitative association rules in large relation tables, in: Proc. ACM SIGMOD, 1996, pp [29] Stanford University, Available from November [30] J.C. Venter, M.D. Adams, E.W. Myers, P.W. Li, et al., The sequence of the human genome, Science 291 (2001) Xiang-Rong Jiang received his Ph.D. degree in Synthetic Organic Chemistry from Shanghai Institute of Organic Chemistry, Shanghai, China, in He is a Senior Research Associate at College of Pharmacy, University of South Carolina. His current research interests include design, synthesis of the selective Estrogen Receptor modulators. Le Gruenwald is a Professor in the School of Computer Science at University of Oklahoma. She received her Ph.D. in Computer Science from Southern Methodist University in She was a Software Engineer at White River Technologies, a Lecturer in the Computer Science and Engineering Department at Southern Methodist University, and a Member of Technical Staff in the Database Management Group at the Advanced Switching Laboratory of NEC, America. Her major research interests include Web-enabled Databases, Mobile Databases, Real-Time Main Memory Databases, Multimedia Databases, Data Warehouse and Data Mining. She is a member of ACM, SIGMOD, and IEEE Computer Society.

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets