Preprocessing data sets for association rules using community detection and clustering: a comparative study

Size: px
Start display at page:

Download "Preprocessing data sets for association rules using community detection and clustering: a comparative study"

Transcription

1 Preprocessing data sets for association rules using community detection and clustering: a comparative study Renan de Padua 1 and Exupério Lédo Silva Junior 1 and Laís Pessine do Carmo 1 and Veronica Oliveira de Carvalho 2 and Solange Oliveira Rezende 1 1 Instituto de Ciências Matemáticas e de Computação USP - Universidade de São Paulo. São Carlos, Brasil {padua,solange}@icmc.usp.br {exuperio.silva, lais.carmo}@usp.br 2 Instituto de Geociências e Ciências Exatas UNESP - Univ Estadual Paulista. Rio Claro, Brasil veronica@rc.unesp.br Abstract. Association rules are widely used to analyze correlations among items on databases. One of the main drawbacks of association rules mining is that the algorithms usually generate a large number of rules that are not interesting or are already known by the user. Finding new knowledge among the generated rules makes the association rules exploration a new challenge. One possible solution is to raise the support and the confidence values, resulting in the generation of fewer rules. The problem with this approach is that as the support and confidence values come closer to 100%, the generated rules tend to be formed by dominating items and to be more obvious. Some research has been carried out on the use of clustering algorithms to prepare the database, before extracting the association rules, so the grouped data can consider the items that appear only in a part of the database. However, even with the good results that clustering methods have shown, they only rely on similarity (or distance) measures, which makes the clustering limited. In this paper, we evaluated the use of community detection algorithms to preprocess databases for association rules. We compared the community detection algorithms with two clustering methods, aiming to analyze the generated novelty. The results have shown that community detection algorithms perform well regarding the novelty of the generated patterns. 1. Introduction Association rules are widely used to extract correlations among items on a given database due its simplicity. The rules have the following pattern: LHS RHS, where LHS is the left-hand side (rule antecedent), RHS is the right-hand side (rule consequent). LHS RHS occurs with the probability of c%, the confidence of the rule, which is used to generate the association rules, along with the support [1]. The association rules extraction is generally done in two steps: (i) frequent itemset mining; (ii) pattern extraction. Step (i) is done using support, that measures the number of transactions an itemset occurs (an itemset is a set of items that occurs in the database). For example, given an itemset {a, b, c}, the support of the itemset is the number of transactions that contains all the three items divided by the total number of transactions. Step (ii) consists of combining all the frequent itemsets and calculating the confidence of each SBC ENIAC-2016 Recife - PE 553

2 possible rule. The confidence measures the quality of the rule, i.e., the probability of c% that the RHS will occur providing that the LHS occurred. The confidence is calculated by dividing the support of the rule by the LHS support. This process has two main drawbacks: (a) the items that are not dominant in the database (have low support) but have interesting correlations are usually not found. This is the local knowledge, the interesting knowledge that represents a portion of the database but that is not very frequent; (b) the number of generated rules that are not interesting normally surpass the user s ability to explore them. Aiming to solve the first drawback (a), studies have been done on the database preprocessing before submitting it to the association rules extraction. One way of doing this is to cluster the data used to generate association rules, aiming to extract the local knowledge inside each group that would be not extracted in the original database [8, 9]. These clustering approaches obtained interesting results, since they can (i) extract the local knowledge from databases and (ii) organize the domain. However, the traditional clustering methods rely only on similarity (or distance) measures, which can disregard important characteristics of the database. One area that is growing in the literature is community detection, which groups the data modeled in a complex network. These works have presented great performance in the literature and are only used to group the data, not being used to mine association rules. Community detection algorithms are capable of finding structural communities in the database using networks [12]. Therefore, this characteristic may improve the association rule mining [17, 13]. Based on the exposed, this paper analyzed some community detection algorithms in association rule mining context, comparing the results to the traditional clustering methods. The results have shown that community detection algorithms provide a different analysis on the data, exploring a different part of the knowledge, generating new and possibly interesting associations. For that, the paper is organized as follows. Section 2 describes the related research. Section 3 presents the assessment methodology used. Section 4 presents the experiment configurations. Section 5 discusses the results obtained in the experiments. Finally, conclusion and future works are given in Section Background The preprocessing area aims to reduce the number of items to be analyzed by the association rule extraction algorithm in order to consider the items that compose the local knowledge of the database. In the literature, the preprocessing researches normally use clustering algorithms to find the local knowledge, putting the similar knowledge together. This way the items that have low support will be considered by the association rule extraction algorithms and the dominating items will be split into all the groups and become less dominant. In [1] a clustering algorithm was proposed, called CLASD (CLustering for ASsociation Discovery), to cluster the transactions in a database. They proposed a measure to calculate the similarity among transactions considering the items they have. The measure is shown in Equation 1, where CountT() returns the number of transactions that have the items inside the parenthesis, T x T y the items that both transactions have and T x T y the items that has at least one transaction. The algorithm is hierarchical, using a bottom- SBC ENIAC-2016 Recife - PE 554

3 up strategy, which means that each transaction starts being an unitary group and merges until a parameter k, that represents the number of desired groups, is met. The merge between 2 groups is done using complete linkage. The results have shown that the proposed approach was capable of finding more frequent itemsets compared to a random partition and to the original data, creating new rules to be explored that would not be extracted in the other cases. Sim(T x, T y ) = CountT (T x T y ) CountT (T x T y ) (1) In [14] a different approach is proposed. It still clusters the data, but, instead of clustering the transactions, the approach clusters the items in the database. The study calculates the similarity among items based on the transactions they support, as shown in Equation 2, where T rans(i x ) returns all the transactions that have the item I x and and have the same meaning of Equation 1. Besides the measure, the authors used three other measures to calculate the items similarity and applied some clustering algorithms over the database. The results demonstrated that the proposed approach performs great on sparse data, where the general support is extremely low, extracting rules that would be generated if the association rule extraction algorithm was applied over a non-clustered data set. Sim(I x, I y ) = CountT (T rans(i x) T rans(i y )) CountT (T rans(i x ) T rans(i y )) (2) Both approaches aimed to find rules that would not be generated in the original database. However, the analysis that was performed by the authors is merely on the number of new rules (or new itemsets) that were mined. They do not analyze the results measuring the ratio of new rules and the number of maintained rules, among other aspects, which verify the quality of the preprocessing algorithms (see Section 3). More works using clustering can be seen in [8] and [9]. Besides these approaches, some works use community detection algorithms in order to preprocess the data set. Community detection algorithms search for natural groups that occur in a network, regardless of their numbers and sizes, which is used as a primary tool to discover and understand large-scale structures of networks [12]. The main goal of the community detection algorithms is to split the database into groups according to the network structure. The main difference from the clustering methods is that the community detection algorithms do not use only the similarity among the items to do the split, but also uses the network structure to find the groups [12]. In [17] the authors used a structure called Product Network: each vertex of the network represents an item and the edge (link) between them represents the number of transactions they occur together. First of all, the authors created the Product Network and applied a filter, reducing the number of connections among different products. Then, a community detection algorithm was applied to group the products. The authors obtained a great number of communities, around 28+ on each data set they used, each one containing a mean of 7 items per group. The entire discussion is made on the amount of knowledge that will be explored in each group, making the exploration and understanding of each SBC ENIAC-2016 Recife - PE 555

4 group easier for the user. More works using community detection algorithms can be seen in [13] and [3]. 3. Assessment Methodology The approaches available in the literature have presented good results, as described in Section 2, but the analysis they made do not consider some important points. Yet, the traditional clustering methods rely only on similarity (or distance) measures, which can disregard important characteristics of the database. The use of complex networks to find groups can include important features on the clustering process, such as the elements position and the data density. Figure 1. The assessment methodology. Based on this, this paper proposes an analysis on community detection algorithms, which uses the network structure to find groups, to aid the association rule mining process. The proposed analysis is illustrated in Figure 1. The analysis consists of comparing the results obtained by the traditional clustering methods with the results obtained by the community detection algorithms. Therefore, some metrics proposed in [6] are used (described below). To compute the metrics, each process is executed: the traditional (A) and the process that uses community detection (B). First, the original association rules set (AR) is mined from the databases. This set is used to compute the evaluation metrics (see below). Then, in all the databases, the similarity among all the transactions is calculated. After that, in (A), a traditional clustering algorithm is applied. Inside each obtained group, an AR extraction algorithm is executed, generating groups of rules. All of these rules form the clustered AR set (AR cl ). In (B), based on the computed similarities, the data is modeled through a network vertices are transactions and edges are the computed similarity among them. Then, a community detection algorithm is applied. Inside each obtained group, an AR extraction algorithm is executed, generating groups of rules. All of these rules form the community detection AR set (AR cd ). Considering the obtained AR sets, the evaluation metrics are computed and the comparison done. Regarding the evaluation metrics, we used some of the metrics proposed in [6]. Some of the proposed metrics evaluate the benefits obtained from the data grouping, which means that some measures needs to consider the original association rule set (AR) and an association rule set generated after the grouping (AR cl or AR cd ). Besides, the authors use the concept of h-top best rules, that consists of selecting the h% best rules, SBC ENIAC-2016 Recife - PE 556

5 based on an objective measure, from the AR set and the h% best rules from the grouped set that will be analyzed (AR cl or AR cd ). These are the rules that are considered as the most interesting to the user according to the objective measure that was chosen. In this work, h was set to 1% of the total of the rules and the objective measure used was Lift. For details about objective measures see [7]. The metrics that we selected to use in this work are: MO-RSP: ratio of the rules in the AR set that were kept in AR cl or AR cd. The aim is to analyze the amount of knowledge that was maintained; the higher the value the better the result. MR-O-RSP: ratio of the rules in the AR set that were generated more than once in AR cl or AR cd (the same rule can be extracted in different groups). The aim is to analyze the amount of knowledge that was repeatedly generated in different groups; the lower the value the better the result. MN-RSP: ratio of new rules in AR cl or AR cd. A rule is new if it is not in the AR set. The aim is to analyze the number of rules that were generated in AR cl or AR cd that were not generated in AR; the higher the value the better the result. MR-N-RSP: ratio of new rules that were generated more than once in AR cl or AR cd (the same rule can be extracted in different groups). The aim is to analyze the repetition of rules in AR cl or AR cd ; the lower the value the better the result. MN-I-RSP: ratio of new rules generated in AR cl or AR cd that are among the h-top best rules of the set (AR cl or AR cd ). The aim is to analyze the number of new rules that are considered as interesting; the higher the value the better the result. MO-I-N-RSP: ratio of rules among the h-top best rules of the AR set that were not contained in the AR cl or AR cd. The aim is to analyze the number of interesting rules in AR that was lost in AR cl or AR cd ; the lower the value the better the result. MC-I: ratio of rules among the h-top best rules in AR and the h-top best rules in AR cl or AR cd. The aim is to analyze the number of rules that are considered interesting both in AR and AR cl or AR cd ; on this measure, no consensus was met by the specialists. MNC-I-RSP: ratio of clusters that contain all the h-top best rules in AR cl or AR cd. The aim is to analyze the percentage of groups that needs to be explored in order to find all the h-top best rules; the lower the value the better the result. It is important to highlight that even the metrics description always cite the grouped sets together ( AR cl or AR cd ), the measures are applied only over one grouped set at a time. This means that if we have 4 different algorithms, the metrics will be calculated to each of these algorithms separately from the others. The two sets are always cited together to strengthen the notion that the metrics were calculated to all the grouped data sets. 4. Experimental Setup The assessment methodology was applied on six databases: balance-scale (bs), breastcancer (bc), car, dermatology (der), tic-tac-toe (ttt) and zoo. All of them are available at UCI repository 1. These databases were processed and converted to an attribute/value format. The modified databases can be downloaded from SBC ENIAC-2016 Recife - PE 557

6 icmc.usp.br/padua/baseseniac2016.zip. To generate the association rules, the apriori algorithm [2] was used. The implementation used is available at Christian Borgelt s homepage 2. To obtain the AR set, the minimum support and the minimum confidence were empirically defined aiming to extract from to rules 3. These values can be seen in Table 1. The second column represents the number of transactions, namely, the number of examples that are available. The third column presents the number of attributes of each data set. It is important to say that, after the data set conversion to the attribute/value format, each attribute was divided into N attributes, being N the number of possible values the attribute had. For example, if a data set contains the attribute color and 3 possible values: blue, yellow and red, then the processed data set will have 3 attributes: color=blue, color=yellow and color=red. The fourth column presents the minimum support and the minimum confidence used. The last column presents the number of generated rules. Table 1. Databases details. Data set #Transac #Attrib Sup/Conf #Rules balance-scale (bs) %/5% 1056 breast-cancer (bc) %/15% 2220 car %/5% 758 dermatology (der) %/75% 2687 tic-tac-toe (ttt) %/10% 4116 zoo %/50% 1637 To split the transactions into groups, it is necessary to compute a similarity measure over all transactions. The similarity measure used was Jaccard, presents in Equation 3, where Items(T x ) returns all the items contained on transaction X and #(Y) returns the number of items contained on Y. The dissimilarity from T x to T y is computed by 1 Jaccard(T x, T y ). Jaccard(T x, T y ) = #(Items(T x) Items(T y )) #(Items(T x ) Items(T y )) (3) Regarding the process (A) in Figure 1, two clustering algorithms were used: Partitioning Aroung Medoids (PAM) and Ward [10], both available on R 4. We selected these algorithms because PAM is a partitional clustering method, which divides the data set according to a K parameter. Ward is a hierarchical method, which creates a dendogram that can be cut on different heights. In this paper, the number of generated groups were set according to the number of groups obtained from the community detection algorithms (it is important to use the same number of groups to compute the evaluation metrics). Regarding the process (B) in Figure 1, that uses community detection algorithms, the data were modeled through a simple homogeneous network, being each node a transaction and the edge/weight between two nodes the similarity between them (similarities Due to the sensibility of the tic-tac-toe data set, a greater number of rules was generated. 4 SBC ENIAC-2016 Recife - PE 558

7 equals 0 means no connection among the transactions). Besides, four different community detection algorithms were used (all of them available on igraph 5 ): Modularity based [11], Leading Eigenvector [12], Spinglass [16] and Walktrap [15]. These community detection algorithms were selected due to their variety of characteristics. The modularity based algorithm does not need a parameter definition. The Leading Eigenvector needs the number of clusters; however, the igraph has a default configuration that was used. The Walktrap algorithm has the number of steps as a parameter, that was defined to 4 according to [15]. The Spinglass parameters were defined accordingly to [16]. The algorithms Modularity and Spinglass start by a random division of the vertices in different groups and go through the entire network performing changes in the vertice groups and calculating the new values. The Walktrap algorithm starts with every vertex in a lone group and merges the closest groups each iteration. A general overview of these algorithms is given below: Modularity based [11]: this community detection algorithm uses the modularity measure, which penalizes connections among vertices in different communities and considers connections among vertices in the same community. Leading Eigenvector [12]: this algorithm uses the concept of eigenvector and eigenvalue, calculated over the similarity matrix, to group the data. Spinglass [16]: this algorithm uses a measure that was divided in four parts. Two of them consider the connections among the vertices in the same community and the lack of connections among vertices in different communities. The other two penalize the connections among vertices in different communities and the lack of connections among vertices in the same communities. Walktrap [15]: this algorithm uses the Ward s method, which merges two communities, in each step, considering the squared distance among them. Finally, to define the minimum support and the minimum confidence to be used inside the groups (as presented in Figure 1), the following strategy was used: the largest group, in other words, the one that contained the higher number of transactions, was selected and the minimum support and the minimum confidence was empirically defined aiming to extract no more than 1000 rules. The obtained values were applied to all the other groups. This process was done for each data set. Table 2 presents the number of groups and the minimum support and the minimum confidence that were used. This table presents the configurations only regarding the Spinglass community detection algorithm, since it obtained the best results among the community detection algorithms. All the results, containing all the configurations, can be seen on icmc.usp.br/padua/resultadoseniac2016.xlsx. It is important to say that in the dermatology data set the support and the confidence used in the Ward clustered set needed to be lowered to 90%/90%, since no rule was generated using 99%/99%. Also, in the zoo data set, the values were also lowered to 40%/50% for the Ward s algorithm for the same reason. 5. Results and Discussion The values obtained in each one of the metrics are presented from Tables 3 to 6. Table 3 presents the results regarding MO-RSP and MR-O-RSP. The first column presents the 5 SBC ENIAC-2016 Recife - PE 559

8 Table 2. Support and Confidence values used on the grouped data. Data set # of groups Sup/Conf Group balance-scale (bs) 3 5%/20% breast-cancer (bc) 3 25%/50% car 9 10%/50% dermatology (der) 3 99%/99% tic-tac-toe (ttt) 9 20%/50% zoo 3 90%/90% database name (see Table 1). From the second to the fourth column the values of MO- RSP are presented to the Spinglass (SP), PAM and WARD (WD) algorithm, respectively. From the fifth to the seventh column the values of MR-O-RSP are presented to the Spinglass (SP), PAM and WARD (WD) algorithm, respectively. Only the results regarding Spinglass are presented since it got the best results among the community detection algorithms. All the other tables (Table 4, 5 and 6) follows the same pattern. The best results in each data set are highlighted in gray. As mentioned, Table 3 presents the results obtained by MO-RSP and MR-O-RSP. The first metric analyzes the amount of knowledge maintained from the AR set and the second the amount of this knowledge that is repeated. It can be seen that Spinglass presented a better performance in 3/6 data sets on MO-RSP, while PAM a better performance in 2 and Ward in 1 data set. The value obtained in the second metric, 0% of repetition, was the same in all algorithms and data sets, which is a very interesting result, that shows that no rule is repeated in the AR cd nor AR cl. By analyzing these measures together, it can be seen that the Spinglass algorithm is capable of generating new knowledge while maintaining part of the knowledge in the AR set. The PAM algorithm presents a similar behavior; however, it was not as effective as Spinglass. The ward algorithm won in the zoo data set since the support and the confidence had to be lowered to the same value used in the AR set due to the lack of rules generated. Table 3. Results obtained by MO-RSP and MR-O-RSP metrics. Base MO-RSP MR-O-RSP SP PAM WD SP PAM WD bs 64.39% 61.74% 9.19% 0.00% 0.00% 0.00% bc 59.23% 55.32% 11.80% 0.00% 0.00% 0.00% car 89.18% 99.47% 42.08% 0.00% 0.00% 0.00% der 19.09% 7.03% 0.33% 0.00% 0.00% 0.00% ttt 43.33% 60.42% 2.04% 0.00% 0.00% 0.00% zoo 32.25% 31.77% 100% 0.00% 0.00% 0.00% Table 4 presents the results obtained by MN-RSP and MR-N-RSP. The first metric analyzes the ratio of new knowledge generated and the second the ratio of new knowledge that repeats in different groups. In MN-RSP there is a tie between Spinglass and PAM, where each one of them had won on 3 data sets. However, in the tic-tac-toe data set, the amount of new knowledge generated by the Spinglass algorithm is almost 40% more than the amount generated by PAM; besides, the biggest difference in the case PAM won SBC ENIAC-2016 Recife - PE 560

9 is about 10%. In the second metric a tie also occurred: both PAM and SP had the best results in 4 data sets. However, after taking a closer look at the results it can be seen that Spinglass is more stable than PAM, as presented on the balance-scale data set. PAM obtained 100% in one case, which is the worst possible value for this metric the worst result obtained by SP was 11.97%. Table 4. Results obtained by MN-RSP and MR-N-RSP metrics. Base MN-RSP MR-N-RSP SP PAM WD SP PAM WD bs 10.29% 0.00% 0.00% 0.00% 100% 100% bc 4.99% 14.84% 0.00% 0.00% 0.00% 100% car 76.83% 76.82% 0.93% 10.99% 8.42% 0.00% der 76.07% 85.88% 0.00% 0.00% 0.00% 100% ttt 87.89% 48.88% 0.00% 11.97% 0.08% 100% zoo 84.85% 86.62% 47.18% 0.00% 0.00% 4.32% Table 5 presents the results obtained from MN-I-RSP and MO-I-N-RSP. The first metric analyzes the ratio of new rules in AR cl or AR cd that are among the h-top interesting rules of their sets and the second the ratio of rules that were among the h-top rules in the AR set that are not found in AR cl or AR cd. Regarding the first metric it can be seen that the Spinglass won in the most of the data sets, meaning that it generated more interesting new rules compared to the other algorithms. Even in the cases it lost (car and zoo), the distance of the results is no more than 1%, and in the cases it won, the difference could reach almost 7%. In the second metric all the algorithms performed badly, getting only 3 values different from 100% (2 on PAM and 1 on WARD). This means that, in almost all the cases, the rules among the h-top best rules in the AR set were not found in AR cl or AR cd. Analyzing these two metrics together it is possible to see that Spinglass brought more novelty in the generated knowledge compared to the others, not maintaining what was considered interesting before (100% on all data sets). On the other hand, the others generated less new interesting knowledge, but maintained some of the rules that were considered interesting in the AR set in some of the cases. Table 5. Results obtained by MN-I-RSP and MO-I-N-RSP metrics. Base MN-I-RSP MO-I-N-RSP SP PAM WD SP PAM WD bs 87.5% 85.71% 66.67% 100% 100% 100% bc 93.75% 62.50% 85.71% 100% 77.27% 100% car 97.44% 98.00% 35.71% 100% 100% 57.14% der 95.65% 93.33% 0.00% 100% 100% 100% ttt 99.44% 92.98% 66.67% 100% 92.68% 100% zoo 97.22% 97.44% 98.25% 100% 100% 100% Table 6 presents the results obtained from MC-I and MNC-I-RSP. The first metric analyzes the ratio of rules that are contained in both h-top rules sets, that is, the rules that are in the AR set h-top best and also are in the AR cl or AR cd h-top best. The second metric analyzes the percentage of clusters needed to be explored in order to find all the SBC ENIAC-2016 Recife - PE 561

10 h-top rules in AR cl or AR cd. On the first metric no cell was highlighted because the specialists did not reach a consensus in [6] regarding the value interpretation. The 0% value means that all the rules selected as the h-top best rules in AR cl or AR cd were not selected as the h-top best rules on AR. This means that all the interesting knowledge in AR cl or AR cd is new, directing the user to explore an entire new set of interesting rules compared to the AR set. In the second metric the Spinglass algorithm won again, concentrating all the interesting rules in few of the groups compared to the others. The Spinglass won in 4 data sets, where PAM won in 2 (1 tie with SP) zoo data set had a tie among the all the 3 algorithms. Table 6. Results obtained by MC-I and MNC-I-RSP metrics. Base MC-I MNC-I-RSP SP PAM WD SP PAM WD bs 0.00% 0.00% 0.00% 33.33% 66.67% 100% bc 0.00% 22.73% 0.00% 100% 66.67% 100% car 0.00% 0.00% 28.57% 11.11% 33.33% 88.89% der 0.00% 0.00% 0.00% 33.33% 33.33% 100% ttt 0.00% 7.32% 0.00% 11.11% 77.78% 100% zoo 0.00% 0.00% 0.00% 33.33% 33.33% 33.33% In the metrics MN-RSP, MN-I-RSP and MO-I-N-RSP, that analyze the new knowledge, the Spinglass algorithm performed better than the PAM and WARD. These measures together indicate that the Spinglass algorithm can be used in the cases that the user wants to obtain new knowledge, without the need of maintaining what was considered interesting in the non-clustered data set. The metrics MO-RSP and MC-I, that analyze the maintained knowledge, the PAM algorithm got higher results. This means that the PAM algorithm was capable of finding new knowledge; however, part of the interesting knowledge from the original data set was maintained together with the new obtained interesting results. So, the result indicated that the Spinglass algorithm can generate more new knowledge, while the PAM and Ward algorithm is better in maintaining the knowledge that was previously considered interesting. 6. Conclusion This paper presented a comparative study among two traditional clustering algorithms and four community detection algorithms on the context of association rule preprocessing using six data sets. The study used 8 metrics, all of them proposed in [6], to analyze the results. In this paper, we presented and compared the results obtained by the Spinglass algorithm with two clustering algorithms because it was the one that obtained the best results among the four community detection algorithms that were selected. The complete results, containing all six algorithms, can be seen on usp.br/padua/resultadoseniac2016.xlsx. The results demonstrated that PAM and Spinglass had a very similar performance, both of them obtaining good results. However, the Spinglass algorithm performed better in the metrics that analyzed the amount of new knowledge generated. This indicates that the Spinglass algorithm can generate more novelty compared to the clustering methods. On the metrics that analyzed the amount of knowledge maintained, PAM performed better. SBC ENIAC-2016 Recife - PE 562

11 This indicates that PAM is better to maintain the knowledge that would be generated in the AR set. Ward did not show good results. Therefore, the results indicate that Spinglass has potential to discover new rules that bring novelty to the user. Also, this algorithm was able to generate all the interesting rules (the rules among the h-top best rules) on a small number of clusters, indicating that the partitioning is more concise. This initial exploration shows interesting results. However, there are many improvements to be done. As seen, there is a need to propose a community detection algorithm focused on the context of association rules preprocessing. This algorithm must consider the intrinsic characteristics of the context, as the high density, for example, and must be capable of splitting the domain in a way that the interesting knowledge appear together. Besides, a deeper study needs to be done, considering more data sets and more algorithms to better analyze the results obtained by each type of grouping. Acknowledgment We wish to thank CAPES and FAPESP: Grant 2014/ , São Paulo Research Foundation (FAPESP) for the financial aid. References [1] Aggarwal, C., Procopiuc, C., and Yu, P. (2002). Finding localized associations in market basket data. IEEE Transactions on Knowledge and Data Engineering, 14(1): [2] Agrawal, R. and Imielinski, T. and Swami, A. (1994). Mining Association Rules Between Sets of Items in Large Databases. Special Interest Group on Management of Data, [3] Alonso, A. G. and Carrasco-Ochoa, J. A. and Medina-Pagola, J. E. and Trinidad, J. F. M. (2011). Reducing the Number of Canonical Form Tests for Frequent Subgraph Mining. Computación y Sistemas, [4] Berrado, A. and Runger, G. C. (2007). Using metarules to organize and group discovered association rules. Data Mining and Knowledge Discovery, 14(3): [5] Carvalho, V. O.; dos Santos, F. F.; Rezende, S. O. and de Padua, R. (2011), PAR-COM: A New Methodology for Post-processing Association Rules., Proceedings of ICEIS, Springer, pp [6] Carvalho, V. O. and dos Santos, F. F. and Rezende, S. O. (2015). Metrics for Association Rule Clustering Assessment. Transactions on Large-Scale Data- and Knowledge- Centered Systems XVII, [7] Carvalho, V. O. and Padua, R. and Rezende, S. O. (2016). Solving the Problem of Selecting Suitable Objective Measures by Clustering Association Rules Through the Measures Themselves. SOFSEM 2016: [8] Koh, Y.S. and Pears, R. (2008). Rare Association Rule Mining via Transaction Clustering. In Proceedings of Seventh Australasian Data Mining Conference, [9] Maquee, A. and Shojaie, A. A. and Mosaddar, D. (2012). Clustering and association rules in analyzing the efficiency of maintenance system of an urban bus network. International Journal of System Assurance Engineering and Management, SBC ENIAC-2016 Recife - PE 563

12 [10] Murtagh, F. and Legendre, P. (2014). Ward s hierarchical agglomerative clustering method: which algorithms implement Ward s criterion? Journal of Classification, 31: [11] Newman, M. E. J. (2010), Networks: An Introduction, Oxford University Press. [12] Newman, M. E. J. (2006), Finding community structure in networks using the eigenvectors of matrices, Physical review E 74 (3). [13] Özkural, E. and Uçar, B. and Aykanat, C. (2011), Parallel Frequent Item Set Mining with Selective Item Replication, IEEE Transactions on Parallel and Distributed Systems, [14] Plasse, M., Niang, N., Saporta, G., Villeminot, A., and Leblond, L. (2007). Combined use of association rules mining and clustering methods to find relevant links between binary rare attributes in a large data set. Computational Statistics & Data Analysis, 52(1): [15] Pons, P. and Latapy, M. (2005), Computing communities in large networks using random walks (long version), Computer and Information Sciences-ISCIS 2005, [16] Reichardt, J. and Bornholdt, S. (2006), Statistical Mechanics of Community Detection, Physical Review E 74, [17] Videla-Cavieres, I. F. and Ríos, Sebastián A. (2014), Extending Market Basket Analysis with Graph Mining Techniques: A Real Case. Expert Systems with Application, SBC ENIAC-2016 Recife - PE 564

EVALUATING GENERALIZED ASSOCIATION RULES THROUGH OBJECTIVE MEASURES

EVALUATING GENERALIZED ASSOCIATION RULES THROUGH OBJECTIVE MEASURES EVALUATING GENERALIZED ASSOCIATION RULES THROUGH OBJECTIVE MEASURES Veronica Oliveira de Carvalho Professor of Centro Universitário de Araraquara Araraquara, São Paulo, Brazil Student of São Paulo University

More information

arxiv: v1 [cs.db] 7 Dec 2011

arxiv: v1 [cs.db] 7 Dec 2011 Using Taxonomies to Facilitate the Analysis of the Association Rules Marcos Aurélio Domingues 1 and Solange Oliveira Rezende 2 arxiv:1112.1734v1 [cs.db] 7 Dec 2011 1 LIACC-NIAAD Universidade do Porto Rua

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Unsupervised Instance Selection from Text Streams

Unsupervised Instance Selection from Text Streams Unsupervised Instance Selection from Text Streams Rafael Bonin 1, Ricardo M. Marcacini 2, Solange O. Rezende 1 1 Instituto de Ciências Matemáticas e de Computação (ICMC) Universidade de São Paulo (USP),

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Market basket analysis

Market basket analysis Market basket analysis Find joint values of the variables X = (X 1,..., X p ) that appear most frequently in the data base. It is most often applied to binary-valued data X j. In this context the observations

More information

Closed Non-Derivable Itemsets

Closed Non-Derivable Itemsets Closed Non-Derivable Itemsets Juho Muhonen and Hannu Toivonen Helsinki Institute for Information Technology Basic Research Unit Department of Computer Science University of Helsinki Finland Abstract. Itemset

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectives

More information

EFFICIENT ALGORITHM FOR MINING FREQUENT ITEMSETS USING CLUSTERING TECHNIQUES

EFFICIENT ALGORITHM FOR MINING FREQUENT ITEMSETS USING CLUSTERING TECHNIQUES EFFICIENT ALGORITHM FOR MINING FREQUENT ITEMSETS USING CLUSTERING TECHNIQUES D.Kerana Hanirex Research Scholar Bharath University Dr.M.A.Dorai Rangaswamy Professor,Dept of IT, Easwari Engg.College Abstract

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Distances, Clustering! Rafael Irizarry!

Distances, Clustering! Rafael Irizarry! Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Hierarchical clustering

Hierarchical clustering Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63 Agenda K-means versus Hierarchical clustering Agglomerative vs divisive clustering Dendogram (tree) Hierarchical

More information

Multivariate Methods

Multivariate Methods Multivariate Methods Cluster Analysis http://www.isrec.isb-sib.ch/~darlene/embnet/ Classification Historically, objects are classified into groups periodic table of the elements (chemistry) taxonomy (zoology,

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Hierarchical Clustering Lecture 9

Hierarchical Clustering Lecture 9 Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Optimization using Ant Colony Algorithm

Optimization using Ant Colony Algorithm Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining

More information

4. Ad-hoc I: Hierarchical clustering

4. Ad-hoc I: Hierarchical clustering 4. Ad-hoc I: Hierarchical clustering Hierarchical versus Flat Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time. Hierarchical

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Market Basket Analysis: an Introduction. José Miguel Hernández Lobato Computational and Biological Learning Laboratory Cambridge University

Market Basket Analysis: an Introduction. José Miguel Hernández Lobato Computational and Biological Learning Laboratory Cambridge University Market Basket Analysis: an Introduction José Miguel Hernández Lobato Computational and Biological Learning Laboratory Cambridge University 20/09/2011 1 Market Basket Analysis Allows us to identify patterns

More information

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

More information

SGN (4 cr) Chapter 11

SGN (4 cr) Chapter 11 SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Grouping Association Rules Using Lift

Grouping Association Rules Using Lift Grouping Association Rules Using Lift Michael Hahsler Intelligent Data Analysis Group, Southern Methodist University 11th INFORMS Workshop on Data Mining and Decision Analytics November 12, 2016 M. Hahsler

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

Data mining techniques for actuaries: an overview

Data mining techniques for actuaries: an overview Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges

More information

The use of frequent itemsets extracted from textual documents for the classification task

The use of frequent itemsets extracted from textual documents for the classification task The use of frequent itemsets extracted from textual documents for the classification task Rafael G. Rossi and Solange O. Rezende Mathematical and Computer Sciences Institute - ICMC University of São Paulo

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Discovering Knowledge Rules with Multi-Objective Evolutionary Computing

Discovering Knowledge Rules with Multi-Objective Evolutionary Computing 2010 Ninth International Conference on Machine Learning and Applications Discovering Knowledge Rules with Multi-Objective Evolutionary Computing Rafael Giusti, Gustavo E. A. P. A. Batista Instituto de

More information

Approach to Evaluate Clustering using Classification Labelled Data

Approach to Evaluate Clustering using Classification Labelled Data Approach to Evaluate Clustering using Classification Labelled Data by Tuong Luu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics

More information

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December

More information

Spectral Methods for Network Community Detection and Graph Partitioning

Spectral Methods for Network Community Detection and Graph Partitioning Spectral Methods for Network Community Detection and Graph Partitioning M. E. J. Newman Department of Physics, University of Michigan Presenters: Yunqi Guo Xueyin Yu Yuanqi Li 1 Outline: Community Detection

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Lecture 12: Unsupervised learning

Lecture 12: Unsupervised learning 1 / 40 Lecture 12: Unsupervised learning Clustering, Association Rule Learning Prof Alexandra Chouldechova 95-791: Data Mining April 20, 2016 2 / 40 Agenda What is Unsupervised learning? K-means clustering

More information

Clustering Part 3. Hierarchical Clustering

Clustering Part 3. Hierarchical Clustering Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points

More information

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1 3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao

More information

SCHEME OF COURSE WORK. Data Warehousing and Data mining

SCHEME OF COURSE WORK. Data Warehousing and Data mining SCHEME OF COURSE WORK Course Details: Course Title Course Code Program: Specialization: Semester Prerequisites Department of Information Technology Data Warehousing and Data mining : 15CT1132 : B.TECH

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Comparative Study of Clustering Algorithms using R

Comparative Study of Clustering Algorithms using R Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Forestry Applied Multivariate Statistics. Cluster Analysis

Forestry Applied Multivariate Statistics. Cluster Analysis 1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING Huebner, Richard A. Norwich University rhuebner@norwich.edu ABSTRACT Association rule interestingness measures are used to help select

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,

More information

The Application of K-medoids and PAM to the Clustering of Rules

The Application of K-medoids and PAM to the Clustering of Rules The Application of K-medoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. Rayward-Smith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research

More information

A k-means Clustering Algorithm on Numeric Data

A k-means Clustering Algorithm on Numeric Data Volume 117 No. 7 2017, 157-164 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A k-means Clustering Algorithm on Numeric Data P.Praveen 1 B.Rama 2

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Data Understanding Exercise: Market Basket Analysis Exercise:

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Challenges and Interesting Research Directions in Associative Classification

Challenges and Interesting Research Directions in Associative Classification Challenges and Interesting Research Directions in Associative Classification Fadi Thabtah Department of Management Information Systems Philadelphia University Amman, Jordan Email: FFayez@philadelphia.edu.jo

More information

An ICA-Based Multivariate Discretization Algorithm

An ICA-Based Multivariate Discretization Algorithm An ICA-Based Multivariate Discretization Algorithm Ye Kang 1,2, Shanshan Wang 1,2, Xiaoyan Liu 1, Hokyin Lai 1, Huaiqing Wang 1, and Baiqi Miao 2 1 Department of Information Systems, City University of

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION Helena Aidos, Robert P.W. Duin and Ana Fred Instituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal Pattern Recognition

More information

Performance Based Study of Association Rule Algorithms On Voter DB

Performance Based Study of Association Rule Algorithms On Voter DB Performance Based Study of Association Rule Algorithms On Voter DB K.Padmavathi 1, R.Aruna Kirithika 2 1 Department of BCA, St.Joseph s College, Thiruvalluvar University, Cuddalore, Tamil Nadu, India,

More information

Hierarchical Graph Clustering: Quality Metrics & Algorithms

Hierarchical Graph Clustering: Quality Metrics & Algorithms Hierarchical Graph Clustering: Quality Metrics & Algorithms Thomas Bonald Joint work with Bertrand Charpentier, Alexis Galland & Alexandre Hollocou LTCI Data Science seminar March 2019 Motivation Clustering

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

Machine learning - HT Clustering

Machine learning - HT Clustering Machine learning - HT 2016 10. Clustering Varun Kanade University of Oxford March 4, 2016 Announcements Practical Next Week - No submission Final Exam: Pick up on Monday Material covered next week is not

More information

V4 Matrix algorithms and graph partitioning

V4 Matrix algorithms and graph partitioning V4 Matrix algorithms and graph partitioning - Community detection - Simple modularity maximization - Spectral modularity maximization - Division into more than two groups - Other algorithms for community

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

An Enhanced K-Medoid Clustering Algorithm

An Enhanced K-Medoid Clustering Algorithm An Enhanced Clustering Algorithm Archna Kumari Science &Engineering kumara.archana14@gmail.com Pramod S. Nair Science &Engineering, pramodsnair@yahoo.com Sheetal Kumrawat Science &Engineering, sheetal2692@gmail.com

More information

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Manju Department of Computer Engg. CDL Govt. Polytechnic Education Society Nathusari Chopta, Sirsa Abstract The discovery

More information

Supervised and Unsupervised Learning (II)

Supervised and Unsupervised Learning (II) Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Medical Data Mining Based on Association Rules

Medical Data Mining Based on Association Rules Medical Data Mining Based on Association Rules Ruijuan Hu Dep of Foundation, PLA University of Foreign Languages, Luoyang 471003, China E-mail: huruijuan01@126.com Abstract Detailed elaborations are presented

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information